Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 8h

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

Researchers investigated the generalization capabilities of auto-interpretation labels for sparse autoencoder (SAE) features in language models. Using Serbian digraphia as a testbed, they found that SAE features activated by similar content across different languages and scripts showed significant overlap, indicating genuine cross-lingual semantic features. However, auto-interpretation labels often failed to keep pace, missing the same meaning in Serbian up to four times more often than in English, and showing a greater failure rate for Serbian Cyrillic compared to Serbian Latin. AI

IMPACT Auto-interpretation labels may not accurately reflect a feature's behavior across different languages and scripts, potentially misleading AI researchers.

Language Models
Serbian
Sparse Autoencoder (SAE)