PulseAugur
EN
LIVE 13:29:13

Language model interpretability: Detection and steering are not aligned

Researchers have investigated the relationship between knowing a behavior's representation in a language model and the ability to steer that behavior. They found that the direction used to detect a behavior, such as hallucination, and the direction used to control it are not the same, with a significant geometric gap observed across multiple models and scales. This dissociation between detection and steering appears to originate during pretraining and is not altered by instruction tuning. While a small rotation towards the steering direction can improve control, the study suggests that detection is a high-dimensional phenomenon, and the simple geometric angle is not a reliable predictor of steerability. AI

IMPACT Reveals a fundamental disconnect between understanding and controlling language model behaviors, potentially impacting future interpretability and alignment research.

RANK_REASON Academic paper detailing a new finding in mechanistic interpretability. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Language model interpretability: Detection and steering are not aligned

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Cosimo Galeone, Anna Ettorre, Minsu Park, Giuseppe Ettorre, Daniele Ligorio ·

    Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

    arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which…