Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design
A new research paper challenges the stability claims of Archetypal Sparse Autoencoders (SAEs), a method designed for more reliable concept extraction in neural networks. The study demonstrates that the reported stability is an artifact of identical initialization across runs, rather than an inherent property of the archetypal constraint. When this deterministic initialization is removed, the archetypal method shows no significant stabilization advantage. The paper also highlights issues with metric design that complicate the interpretation of endpoint stability. AI
IMPACT Challenges the reliability of a specific interpretability technique, potentially impacting how researchers analyze neural network features.