PulseAugur
EN
LIVE 13:20:31

Sparse autoencoders show unstable features form reproducible subspaces

Researchers have investigated the reproducibility of features learned by sparse autoencoders (SAEs), a common tool for interpreting neural network representations. Their study reveals that while individual features can be unstable across different training runs, they often aggregate into reproducible lower-rank subspaces. Stable features are found to carry the majority of the signal relevant for reconstruction and prediction, whereas unstable features have minimal impact and are linked to surface-level triggers. AI

IMPACT Clarifies how to interpret learned features in neural networks, potentially improving model interpretability and debugging.

RANK_REASON This is a research paper detailing findings on the behavior of sparse autoencoders.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov ·

    Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

    arXiv:2606.12138v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature …

  2. arXiv cs.AI TIER_1 English(EN) · Daniil Gavrilov ·

    Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

    Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the …