PulseAugur
EN
LIVE 13:26:48

Ensembling Sparse Autoencoders Enhances Neural Network Interpretability

Researchers have introduced and formalized the concept of ensembling sparse autoencoders (SAEs) to improve the interpretability and utility of neural network features. This approach addresses the limitation that single SAEs capture only a subset of extractable features. By combining multiple SAEs through naive bagging and boosting, the ensemble methods are shown to reduce reconstruction error and enhance the stability of feature extraction. Empirical evaluations on language models demonstrate that SAE ensembles outperform expanded SAEs in reconstructing activations and achieve better performance on downstream tasks like concept detection and spurious correlation removal. AI

RANK_REASON The cluster contains a research paper detailing a new methodology for ensembling sparse autoencoders. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 Nederlands(NL) · Soham Gadgil, Chris Lin, Su-In Lee ·

    Ensembling Sparse Autoencoders

    arXiv:2505.16077v2 Announce Type: replace Abstract: Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown…