Brief

last 24h

[3/3] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.AI · 3d · [2 sources]

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

Researchers have developed a new method called Unpack to analyze the internal workings of transformer models. This technique uses backward recursion to trace how different components, like attention and MLP layers, contribute to a model's output. Unpack can identify interaction strengths and per-token attributions from a single forward pass, without needing interventions or extra training. AI

IMPACT Provides a novel method for understanding transformer model behavior, potentially aiding in debugging and improving model interpretability.
RESEARCH · arXiv cs.LG · 4d · [2 sources]

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Researchers have developed a novel audit pipeline to analyze the internal workings of the GPT-2 Small language model, specifically focusing on its performance on the Indirect Object Identification (IOI) task. The study identified 146 features within the model's activations that correlate with task failure, with one prominent feature, labeled 'cryptographic keys,' showing a strong association with errors when the prompt's object is 'the keys.' While this feature is a significant correlate, causal ablation experiments indicated it is not a sufficient cause for failure at this layer, highlighting the complexity of understanding model behavior. AI

IMPACT Provides a new, efficient methodology for understanding and debugging language model behavior, potentially leading to more interpretable and reliable AI systems.
RESEARCH · arXiv cs.AI · 4d · [2 sources]

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Researchers have developed a five-stage methodology for causal feature analysis in transformer language models, demonstrating its application on GPT-2 small for the Indirect Object Identification task. The method uses activation patching to identify key circuits and a sparse autoencoder to recover selective features, finding these features to be partially causal. Robustness testing revealed a gap between detection and causal robustness, while a cost-based deployment evaluation showed significant savings for an optimal monitor configuration. AI

IMPACT Provides a structured approach to understanding and potentially improving the interpretability and reliability of transformer models.
- GPT-2 small
- Caleb Munigety

Brief

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models