English(EN) The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

研究揭示25种欧洲语言的Tokenizers税

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-26 04:00

一篇新的研究论文分析了“Tokenizers税”，即由于单词被分解成token的方式而导致的非英语自然语言处理的隐藏成本。该研究衡量了十个基础模型在25种欧洲语言中的token肥沃度，揭示了显著的差异。希腊语和马耳他语的肥沃度最高，每个单词需要三个以上的token，而英语仅使用一个多一点。 AI

影响突出了当前非英语语言NLP模型的低效率，可能推动更公平的token化策略的发展。

排序理由学术论文，详细介绍了对NLP token化成本的新分析。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Volodymyr Ovcharov · 2026-05-26 04:00

25种欧洲语言的Tokenizers税：领域不变性、跨语言少样本效应和乌克兰惩罚

arXiv:2605.24718v1 Announce Type: new Abstract: Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax m…