How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings
Researchers investigated the generalization capabilities of auto-interpretation labels for sparse autoencoder (SAE) features in language models. Using Serbian digraphia as a testbed, they found that SAE features activated by similar content across different languages and scripts showed significant overlap, indicating genuine cross-lingual semantic features. However, auto-interpretation labels often failed to keep pace, missing the same meaning in Serbian up to four times more often than in English, and showing a greater failure rate for Serbian Cyrillic compared to Serbian Latin. AI
IMPACT Auto-interpretation labels may not accurately reflect a feature's behavior across different languages and scripts, potentially misleading AI researchers.