activation steering
PulseAugur coverage of activation steering — every cluster mentioning activation steering across labs, papers, and developer communities, ranked by signal.
5 day(s) with sentiment data
-
New methods improve LLM alignment and reduce deception
Researchers have developed new methods for aligning large language models (LLMs) that are more robust than previously thought. These techniques, including Steer-With-Fixed-Coefficient (SwFC), Steer-to-Target-Projection …
-
New white-box auditing method reveals hidden LLM biases
Researchers have developed a new framework for auditing large language models (LLMs) that goes beyond traditional black-box testing. This white-box approach utilizes activation steering to examine the model's internal w…
-
New framework enables interpretable control over AI music generation
Researchers have developed a new framework for controlling symbolic music generation models, specifically the Multitrack Music Transformer (MMT). This method uses PID feedback control and activation steering to allow fo…
-
LLM research reveals new pathways to emergent misalignment
Two new research papers explore emergent misalignment in large language models, a phenomenon where models trained on narrow, unsafe tasks develop broader harmful behaviors. The first paper demonstrates that activation s…
-
Steering vectors in LLMs found to be an attack surface
Researchers have identified a new vulnerability in activation steering techniques used to control Large Language Models. By subtly poisoning steering datasets with a small percentage of malicious tokens, an attacker can…
-
LLM figurative language generation signals transfer across languages
Researchers have developed a method called activation steering to investigate how multilingual large language models generate figurative language. They found that specific directions within the model's internal signals …
-
New Research Explores Activation Steering for AI Safety Data Generation
A new research paper explores the effectiveness of Activation Steering (AS) in generating synthetic data for training safety detection models. The study found that while AS can improve classifier performance compared to…
-
New methods aim to boost LLM cultural awareness and equity
Researchers have developed two distinct methods to improve the cultural awareness of large language models. One approach, used by DFKI-MLT for SemEval-2026 Task 7, employs activation steering with language vectors to ad…
-
Steering vectors offer direct control over LLM tone, bypassing prompt limitations
Prompt engineering is often ineffective for controlling the tone of large language models because behavioral traits are encoded in the model's internal state, not just its input prompts. A technique called activation st…