Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes
Researchers have developed a new method for discovering circuits within large language models by clustering attention head co-activation statistics. This approach, termed "closure-validated circuit discovery," uses causal ablation to confirm whether these identified groups of components actually function as circuits. The method was tested on models like Pythia 1B and OLMo 1B, demonstrating its effectiveness in identifying statistically significant circuits, while also showing limitations in Mixture-of-Experts models. AI
IMPACT This research offers a more rigorous method for understanding internal LLM mechanisms, potentially improving safety and reliability.