PulseAugur
EN
LIVE 13:15:40

Research questions stability of Archetypal SAEs for concept extraction

A new research paper challenges the stability claims of Archetypal Sparse Autoencoders (SAEs), a method designed for more reliable concept extraction in neural networks. The study demonstrates that the reported stability is an artifact of identical initialization across runs, rather than an inherent property of the archetypal constraint. When this deterministic initialization is removed, the archetypal method shows no significant stabilization advantage. The paper also highlights issues with metric design that complicate the interpretation of endpoint stability. AI

IMPACT Challenges the reliability of a specific interpretability technique, potentially impacting how researchers analyze neural network features.

RANK_REASON The cluster contains a research paper published on arXiv discussing a specific methodology in machine learning.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.LG TIER_1 English(EN) · Micha{\l} Brzozowski, Neo Christopher Chung ·

    Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

    arXiv:2606.02061v1 Announce Type: new Abstract: Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random s…

  2. arXiv cs.LG TIER_1 English(EN) · Neo Christopher Chung ·

    Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

    Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetyp…