Sparse autoencoders show stable features carry most signal

By PulseAugur Editorial · [1 sources] · 2026-06-10 14:32

Researchers have investigated the reproducibility of features learned by sparse autoencoders (SAEs), which are commonly used for interpreting neural network representations. Their study reveals a significant asymmetry: stable features are crucial for reconstruction and prediction, while unstable features have minimal impact and are often triggered by superficial patterns. Geometrically, unstable features, though individually non-reproducible across training runs, tend to cluster within reproducible lower-dimensional subspaces, indicating that seed dependence often stems from ambiguity in feature representation rather than pure randomness. By aggregating unique cross-seed features, the researchers were able to construct more stable SAEs. AI

IMPACT Identifies a key challenge in interpreting neural network representations and suggests methods for improving feature stability and interpretability.

RANK_REASON This is a research paper published on arXiv detailing findings about sparse autoencoders. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Daniil Gavrilov · 2026-06-10 14:32

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the …

COVERAGE [1]

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

RELATED ENTITIES

RELATED TOPICS