PulseAugur
EN
LIVE 04:28:26

Sparse autoencoders show stable features carry most signal

Researchers have investigated the reproducibility of features learned by sparse autoencoders (SAEs), which are commonly used for interpreting neural network representations. Their study reveals a significant asymmetry: stable features are crucial for reconstruction and prediction, while unstable features have minimal impact and are often triggered by superficial patterns. Geometrically, unstable features, though individually non-reproducible across training runs, tend to cluster within reproducible lower-dimensional subspaces, indicating that seed dependence often stems from ambiguity in feature representation rather than pure randomness. By aggregating unique cross-seed features, the researchers were able to construct more stable SAEs. AI

IMPACT Identifies a key challenge in interpreting neural network representations and suggests methods for improving feature stability and interpretability.

RANK_REASON This is a research paper published on arXiv detailing findings about sparse autoencoders. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Daniil Gavrilov ·

    Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

    Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the …