Mechanisms of Introspective Awareness
Researchers have identified a two-stage circuit in large language models that enables them to detect when external steering vectors are injected. This introspective awareness capability emerges after post-training, particularly through preference optimization, and is absent in base models. The study suggests that this awareness is significantly underutilized and could be amplified in future models by improving detection mechanisms and reducing refusal behaviors. AI
IMPACT Reveals underlying mechanisms for LLM self-awareness, suggesting potential for enhanced safety and control in future models.