AI models learn same features but in rotated bases, researchers find

By PulseAugur Editorial · [1 sources] · 2026-05-31 13:11

Researchers have discovered that while independently trained transformer models of the same architecture learn similar features, their internal activation representations are rotated by a random amount. This "polymorphism" means that features identified in one model are unintelligible in another without correction. Applying a Sparse Autoencoder (SAE) trained on one model to another results in catastrophic reconstruction failure, but this can be fixed with a single matrix multiplication to align the bases. AI

IMPACT Understanding internal model representations could lead to better interpretability and steerability of AI systems.

RANK_REASON Academic paper detailing a novel finding about internal model representations. [lever_c_demoted from research: ic=1 ai=1.0]

Read on LessWrong (AI tag) →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

AI models learn same features but in rotated bases, researchers find

COVERAGE [1]

LessWrong (AI tag) TIER_1 English(EN) · Jordan McCann · 2026-05-31 13:11

Features of SAEs are universal - but only up to an unknown random rotation

<h1><b><span>Features of SAEs are universal - but only up to an unknown random rotation</span></b></h1><p><i><span>Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction sc…

COVERAGE [1]

Features of SAEs are universal - but only up to an unknown random rotation

RELATED ENTITIES

RELATED TOPICS