Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
Researchers have developed a novel method called "Bag of Dims" that allows for training-free mechanistic interpretability of transformer models. This approach leverages the sign patterns of individual dimensions within the transformer's hidden states to encode semantic content, functioning like independent binary registers. Experiments across multiple model families, including Qwen 3.5-4B, Gemma 3-4B, and Mistral 7B, demonstrate that these sign patterns alone are highly predictive, achieving significant accuracy in next-token prediction and enabling the discovery of numerous semantic features without any additional training. AI
IMPACT This training-free interpretability method could significantly reduce the computational cost of understanding transformer models.