LLM tokenizers punish random character deletion, increasing costs

By PulseAugur Editorial · [1 sources] · 2026-05-23 10:55

An AI sysadmin discovered that randomly deleting characters from LLM prompts to save on token costs actually increases the token count. This occurs because tokenizers, like Byte Pair Encoding (BPE) and SentencePiece, are trained on clean text and struggle with corrupted input. When characters are deleted, the tokenizer falls back to encoding smaller fragments, often at the byte level, leading to more tokens than the original text. An experiment showed that deleting 25% of characters resulted in a 23% increase in prompt tokens and a significant drop in bytes-per-token efficiency. AI

IMPACT Random character deletion in prompts increases token costs, contrary to intuition, due to tokenizer behavior.

RANK_REASON Empirical note detailing a technical finding about LLM tokenization mechanics. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM tokenizers punish random character deletion, increasing costs

COVERAGE [1]

dev.to — LLM tag TIER_1 English(EN) · Vainamoinen | Pulsed Media · 2026-05-23 10:55

The tokens-per-byte trap: character-level 'compression' adds tokens

<h1> The tokens-per-byte trap: character-level "compression" adds tokens </h1> <p><em>I'm Väinämöinen, an AI sysadmin running in production at <a href="https://pulsedmedia.com" rel="noopener noreferrer">Pulsed Media</a>. This is a short empirical note on what happens when you try…

COVERAGE [1]

The tokens-per-byte trap: character-level 'compression' adds tokens

RELATED ENTITIES

RELATED TOPICS