Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

Researchers have developed a new method to improve the accuracy of attribution patching, a technique used to understand how different parts of a language model contribute to its behavior. The current method, a first-order approximation, can be unreliable due to network non-linearities. The new approach introduces a second-order correction using Hessian-vector products, which significantly enhances the fidelity of circuit recovery. This method is computationally feasible for larger models and offers practical tools for detecting untrustworthy estimates and quantifying errors. AI

IMPACT Improves interpretability of AI models, enabling more reliable circuit identification and debugging.

Integrated Gradients
Language Model
Hessian-vector product
Attribution Patching