Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation
Researchers have developed a new technique called Gradient Interaction Modifications (GIM) to improve the accuracy of circuit localization in large language models. Existing methods often fail to account for interactions between model components, leading to misestimations of their importance. GIM addresses this by explicitly considering these interactions during backpropagation, particularly for attention mechanisms where softmax redistribution can cause gradients to vanish. This new method demonstrates state-of-the-art performance on benchmark tasks and enables more faithful mechanistic analysis of LLMs. AI
IMPACT Enhances interpretability of LLMs, potentially leading to more robust safety and alignment research.