Language model interpretability: Detection and steering are not aligned

By PulseAugur Editorial · [1 sources] · 2026-06-25 04:00

Researchers have investigated the relationship between knowing a behavior's representation in a language model and the ability to steer that behavior. They found that the direction used to detect a behavior, such as hallucination, and the direction used to control it are not the same, with a significant geometric gap observed across multiple models and scales. This dissociation between detection and steering appears to originate during pretraining and is not altered by instruction tuning. While a small rotation towards the steering direction can improve control, the study suggests that detection is a high-dimensional phenomenon, and the simple geometric angle is not a reliable predictor of steerability. AI

IMPACT Reveals a fundamental disconnect between understanding and controlling language model behaviors, potentially impacting future interpretability and alignment research.

RANK_REASON Academic paper detailing a new finding in mechanistic interpretability. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Language model interpretability: Detection and steering are not aligned

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Cosimo Galeone, Anna Ettorre, Minsu Park, Giuseppe Ettorre, Daniele Ligorio · 2026-06-25 04:00

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which…

COVERAGE [1]

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

RELATED ENTITIES

RELATED TOPICS