English(EN) Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

语言模型可解释性：检测与引导不一致

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-25 04:00

研究人员调查了语言模型中行为表示的“知晓”与其“引导”能力之间的关系。他们发现，用于检测行为（如幻觉）的方向与用于控制该行为的方向并不相同，在多个模型和规模上都观察到了显著的几何差距。这种检测与引导之间的分离似乎源于预训练阶段，并且不会因指令调整而改变。虽然朝着引导方向进行的小幅旋转可以改善控制效果，但研究表明，检测是一个高维现象，简单的几何角度并非可引导性的可靠预测指标。 AI

影响揭示了理解和控制语言模型行为之间存在根本性的脱节，可能影响未来的可解释性和对齐研究。

排序理由学术论文，详细介绍了机械可解释性方面的新发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Cosimo Galeone, Anna Ettorre, Minsu Park, Giuseppe Ettorre, Daniele Ligorio · 2026-06-25 04:00

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which…

报道来源 [1]

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

相关实体

相关话题