English(EN) Compute Optimal Tokenization

新研究表明，为实现最优计算，模型规模应与数据字节而非 token 数量成比例增长

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-05 04:00

一篇新论文探讨了 token 粒度对语言模型缩放定律的影响。研究人员训练了 988 个具有不同参数数量和压缩率的模型，以研究分词如何影响计算效率。研究发现，模型参数应与数据字节大小成比例增长，而非 token 数量，并且最优压缩率随计算量而降低，为开发者提供了指导。 AI

影响为优化语言模型计算效率的分词提供了新见解。

排序理由学术论文，详细介绍了分词对 LLM 缩放定律影响的新发现。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer · 2026-05-05 04:00

Compute Optimal Tokenization

arXiv:2605.01188v1 Announce Type: new Abstract: Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the informatio…