Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
Two new research papers propose methods for interpreting the internal workings of transformer models, particularly focusing on their attention mechanisms. The first paper introduces a generic interpretation approach for transformers with heterogeneous attention structures, which are crucial for integrating information from multiple sources. The second paper details a three-step recipe called Spectral Probe-Circuits to identify specific attention-head circuits in pretrained transformers, validating its effectiveness across various model sizes and architectures. AI
IMPACT These new interpretation methods could enhance the transparency and trustworthiness of complex AI models, aiding in debugging, safety analysis, and policy compliance.