EleutherAI finds SAEs trained on same data learn dissimilar features

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers at EleutherAI have found that Sparse Autoencoders (SAEs) trained on identical data and configurations do not consistently learn the same internal features. When training two SAEs with different random initializations, only about 53% of their learned features were shared, indicating significant variability. This overlap decreased as the size of the SAEs increased, suggesting that larger models may develop more arbitrary or diverse internal representations. The study used the Hungarian algorithm to match latents and measure similarity, revealing that feature splitting and absorption might contribute to these disjoint representations. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON The cluster is based on an academic paper detailing research findings on Sparse Autoencoders.

Read on EleutherAI Blog →

paper
other

EleutherAI finds SAEs trained on same data learn dissimilar features

COVERAGE [1]

EleutherAI Blog TIER_1 · 2024-12-12 16:00

SAEs trained on the same data don’t learn the same features

In this post, we show that when two TopK SAEs are trained on the same data, with the same batch order but with different random initializations, there are many latents in the first SAE that don't have a close counterpart in the second, and vice versa. Indeed, when training only a…

COVERAGE [1]

SAEs trained on the same data don’t learn the same features

RELATED TOPICS