SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization
Researchers have introduced SAEExplainer, a new framework designed to improve the interpretability of Sparse Autoencoders (SAEs) within large language models. This method uses activation scores as a reward signal to enable self-correction and iterative refinement of explanations. By reducing explanation hallucinations and reinforcing causal patterns, SAEExplainer demonstrates improved performance over existing methods in experiments. AI
IMPACT Enhances understanding of LLM internal workings, potentially leading to more reliable and debuggable AI systems.