PulseAugur
EN
LIVE 18:18:50

AI models learn same features but in rotated bases, researchers find

Researchers have discovered that while independently trained transformer models of the same architecture learn similar features, their internal activation representations are rotated by a random amount. This "polymorphism" means that features identified in one model are unintelligible in another without correction. Applying a Sparse Autoencoder (SAE) trained on one model to another results in catastrophic reconstruction failure, but this can be fixed with a single matrix multiplication to align the bases. AI

IMPACT Understanding internal model representations could lead to better interpretability and steerability of AI systems.

RANK_REASON Academic paper detailing a novel finding about internal model representations. [lever_c_demoted from research: ic=1 ai=1.0]

Read on LessWrong (AI tag) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

AI models learn same features but in rotated bases, researchers find

COVERAGE [1]

  1. LessWrong (AI tag) TIER_1 English(EN) · Jordan McCann ·

    Features of SAEs are universal - but only up to an unknown random rotation

    <h1><b><span>Features of SAEs are universal - but only up to an unknown random rotation</span></b></h1><p><i><span>Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction sc…