English(EN) How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

研究发现 AI 模型的自动解释标签在跨语言时泛化能力不足

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-02 04:00

研究人员调查了稀疏自编码器 (SAE) 特征在语言模型中自动解释标签的泛化能力。他们以塞尔维亚语的双书写系统为测试平台，发现不同语言和脚本中相似内容激活的 SAE 特征显示出显著的重叠，表明存在真实的跨语言语义特征。然而，自动解释标签往往跟不上步伐，在塞尔维亚语中漏译相同含义的频率是英语的四倍，并且与塞尔维亚语拉丁字母相比，对塞尔维亚语西里尔字母的失败率更高。 AI

影响自动解释标签可能无法准确反映特征在不同语言和脚本中的行为，可能误导 AI 研究人员。

排序理由这是一篇分析 AI 模型解释标签泛化能力的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Sripad Karne · 2026-06-02 04:00

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

arXiv:2606.00356v1 Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these …

报道来源 [1]

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

相关实体

相关话题