English(EN) Mechanisms of Introspective Awareness

大型语言模型通过训练后电路发展出内省意识

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-11 04:00

研究人员在大型语言模型中发现了一个两阶段电路，使其能够检测何时注入了外部引导向量。这种内省意识能力是在训练后出现的，尤其是在偏好优化过程中，而基础模型则不具备这种能力。研究表明，这种意识的利用率显著不足，并且可以通过改进检测机制和减少拒绝行为来在未来模型中得到增强。 AI

影响揭示了大型语言模型自我意识的潜在机制，预示着未来模型在安全性和可控性方面有增强的潜力。

排序理由该集群包含一篇详细介绍大型语言模型机制研究成果的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey · 2026-06-11 04:00

Mechanisms of Introspective Awareness

arXiv:2603.21396v5 Announce Type: replace Abstract: Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms…