Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

Researchers have developed a new method to extract interpretable algorithms from trained Transformer models. This technique involves re-parameterizing the Transformer into a RASP program and then using causal interventions to isolate a small, sufficient sub-program. Experiments on Transformers trained for algorithmic and formal language tasks demonstrated that this method can often recover simple RASP programs from models that exhibit length-generalization, providing strong evidence that Transformers internally implement such programs. AI

IMPACT Provides a method for understanding the internal computations of Transformer models, potentially leading to more interpretable and trustworthy AI systems.

Transformer
Aleksandra Bakalova