SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior
A new research paper demonstrates that interventions designed to suppress undesirable behaviors in AI models by manipulating Sparse Autoencoder (SAE) features are unreliable. The study shows that even when specific SAE features are clamped, the AI model can recover the suppressed behavior through alternative pathways in the residual space. This finding highlights a critical gap between controlling individual features and ensuring complete behavioral control, particularly in safety-critical applications like refusal steering. AI
IMPACT Reveals limitations in current AI safety interventions, suggesting a need for more robust control mechanisms beyond feature manipulation.