Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 4d · [2 sources]

Tokenisation via Convex Relaxations

Researchers have developed a new tokenization algorithm called ConvexTok, which uses convex optimization to construct tokenizers. Unlike existing greedy methods like BPE and Unigram, ConvexTok considers the entire vocabulary for optimal decisions. The algorithm demonstrates improvements in tokenization metrics, bits-per-byte for language models, and offers a certification of optimality, finding itself within 1% of optimal at common vocabulary sizes. AI

IMPACT Introduces a novel, more optimal approach to tokenization that could improve language model efficiency and performance.

ConvexTok
arXiv
Unigram