Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers
Researchers have demonstrated that common methods for assigning specific roles to attention heads in transformer models are insufficient. Their study, involving three instruction-tuned models, found that heads identified as crucial for a behavior often fail to transfer that behavior to different prompts. To address this, they developed a new framework called KID (Knowing / Intent / Doing) and a three-stage pipeline to more accurately assign roles to attention heads. AI
IMPACT Challenges current interpretability methods, potentially leading to more robust understanding of transformer model behaviors.