Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
Researchers have identified a shared activation direction within language models that can detect and mitigate emergent misalignment, particularly when models are fine-tuned on insecure code. This direction, found to be causally specific and actionable within individual models, showed significant success in reducing code spillover. While cross-architecture transfer of these directions proved less specific, an asymmetric topology emerged where Gemma and Qwen models acted as 'donors' and Llama models as 'receivers', highlighting the limitations of linear correction across different architectures. AI
IMPACT Identifies a method for improving AI safety and reliability by detecting and mitigating emergent misalignment in language models.