Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
Researchers have developed a novel audit pipeline to analyze the internal workings of the GPT-2 Small language model, specifically focusing on its performance on the Indirect Object Identification (IOI) task. The study identified 146 features within the model's activations that correlate with task failure, with one prominent feature, labeled 'cryptographic keys,' showing a strong association with errors when the prompt's object is 'the keys.' While this feature is a significant correlate, causal ablation experiments indicated it is not a sufficient cause for failure at this layer, highlighting the complexity of understanding model behavior. AI
IMPACT Provides a new, efficient methodology for understanding and debugging language model behavior, potentially leading to more interpretable and reliable AI systems.