Demystifying Variance in Circuit Discovery of LLMs
A new research paper published on arXiv explores the variability in circuit discovery methods for Large Language Models (LLMs). The study identifies three main sources of variance: resampling, rephrasing, and sample-wise variance. The authors introduce CEAP, a new method that improves upon the existing EAP-IG technique by reducing resampling variance. They also suggest that rephrasing variance indicates LLMs may be inherently difficult to steer due to the diverse ways prompts can activate different internal circuits. Sample-wise variance, they argue, is largely benign and related to the definition of unfaithfulness rather than circuit defects. AI
IMPACT Introduces a new method to improve LLM interpretability and control, potentially aiding in understanding and steering model behavior.