The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
A new research paper analyzes the "tokenizer tax," the hidden cost of non-English natural language processing due to how words are broken into tokens. The study measured token fertility across 25 European languages for ten foundation models, revealing significant variations. Greek and Maltese exhibit the highest fertility, requiring over three tokens per word, while English uses just over one. AI
IMPACT Highlights inefficiencies in current NLP models for non-English languages, potentially driving development of more equitable tokenization strategies.