Researchers have investigated the relationship between knowing a behavior's representation in a language model and the ability to steer that behavior. They found that the direction used to detect a behavior, such as hallucination, and the direction used to control it are not the same, with a significant geometric gap observed across multiple models and scales. This dissociation between detection and steering appears to originate during pretraining and is not altered by instruction tuning. While a small rotation towards the steering direction can improve control, the study suggests that detection is a high-dimensional phenomenon, and the simple geometric angle is not a reliable predictor of steerability. AI
IMPACT Reveals a fundamental disconnect between understanding and controlling language model behaviors, potentially impacting future interpretability and alignment research.
RANK_REASON Academic paper detailing a new finding in mechanistic interpretability. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →