Brief

last 24h

[3/3] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · dev.to — LLM tag English(EN) · 3d

The tokens-per-byte trap: character-level 'compression' adds tokens

An AI sysadmin discovered that randomly deleting characters from LLM prompts to save on token costs actually increases the token count. This occurs because tokenizers, like Byte Pair Encoding (BPE) and SentencePiece, are trained on clean text and struggle with corrupted input. When characters are deleted, the tokenizer falls back to encoding smaller fragments, often at the byte level, leading to more tokens than the original text. An experiment showed that deleting 25% of characters resulted in a 23% increase in prompt tokens and a significant drop in bytes-per-token efficiency. AI

IMPACT Random character deletion in prompts increases token costs, contrary to intuition, due to tokenizer behavior.
RESEARCH · arXiv cs.CL English(EN) · 5d · [2 sources]

Tokenisation via Convex Relaxations

Researchers have developed a new tokenization algorithm called ConvexTok, which uses convex optimization to construct tokenizers. Unlike existing greedy methods like BPE and Unigram, ConvexTok considers the entire vocabulary for optimal decisions. The algorithm demonstrates improvements in tokenization metrics, bits-per-byte for language models, and offers a certification of optimality, finding itself within 1% of optimal at common vocabulary sizes. AI

IMPACT Introduces a novel, more optimal approach to tokenization that could improve language model efficiency and performance.
- ConvexTok
- arXiv
- Unigram
RESEARCH · arXiv cs.CL English(EN) · 5d · [2 sources]

Tokenization with Split Trees

Researchers have developed a new subword tokenization method called Tokenization with Split Trees (ToaST). This method optimizes compression by recursively splitting text into binary trees and selecting vocabulary based on an Integer Program relaxation. ToaST has demonstrated an 11% reduction in token counts compared to existing methods like BPE and WordPiece, and improved performance in training 1.5B parameter language models. AI

IMPACT This new tokenization method could lead to more efficient language models by reducing token counts and extending effective context length.

Brief

The tokens-per-byte trap: character-level 'compression' adds tokens

Tokenisation via Convex Relaxations

Tokenization with Split Trees