Language Model Circuits Are Sparse in the Neuron Basis
Researchers have demonstrated that the neurons within a language model's MLP layers exhibit a degree of sparsity comparable to that of Sparse Autoencoders (SAEs). This finding enables the development of a gradient-based pipeline for circuit tracing, allowing for the identification of causally effective neurons. This method has successfully identified circuits of approximately 100 MLP neurons for controlling model behavior on subject-verb agreement tasks and revealed specific neuron sets encoding reasoning steps for multi-hop city-state-capital tasks, advancing automated interpretability without additional training costs. AI
IMPACT Advances automated interpretability of language models by showing MLP neurons are as sparse as SAEs, enabling circuit tracing without extra training.