English(EN) Tokenization with Split Trees

新的ToaST分词器将词元数量减少了11%以上

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-21 16:46

研究人员开发了一种名为基于分裂树的分词（ToaST）的新子词分词方法。该方法通过将文本递归地分裂成二叉树并基于整数规划松弛选择词汇来优化压缩。与BPE和WordPiece等现有方法相比，ToaST在词元数量上减少了11%，并在训练1.5B参数语言模型方面取得了更好的性能。 AI

影响这种新的分词方法通过减少词元数量和延长有效上下文长度，有望实现更高效的语言模型。

排序理由该集群包含一篇详细介绍一种新的子词分词方法的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner · 2026-05-22 04:00

使用分裂树进行分词

arXiv:2605.22705v1 Announce Type: new Abstract: We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using preco…
arXiv cs.CL TIER_1 English(EN) · Chris Tanner · 2026-05-21 16:46

使用分裂树进行分词

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vo…