PulseAugur
EN
LIVE 12:47:52

New method improves neural network interpretability by addressing dense activations

Researchers have proposed a new method to improve the interpretability of neural networks by questioning the assumption that all activation content can be sparsely decomposed. They hypothesize that activations contain a computationally important, low-rank, dense component that is not well-suited for sparse representation. To address this, they introduced a small linear bottleneck in parallel with standard sparse autoencoders (SAEs), allowing dense structure to be absorbed before sparse reconstruction. This approach demonstrated a significant reduction in dense latent count while improving sparse probing and targeted probe perturbation on Gemma-2-2B layer 12. AI

RANK_REASON The cluster contains a research paper detailing a novel method for improving neural network interpretability. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Ruixuan Deng, Zehao Jin, Zekun Wang, Zihan Dong ·

    Decompose Sparsely Where You Should, Absorb Densely Where You Should No

    arXiv:2606.14040v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) are typically trained to reconstruct the \textbf{entire} residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We q…