PulseAugur
EN
LIVE 11:35:22

Perplexity AI open-sources Unigram tokenizer for 5x speedup

Perplexity AI has open-sourced a new Unigram tokenizer designed to significantly improve CPU performance. This new tokenizer achieves a 5x reduction in latency compared to HuggingFace's implementation and a 2x reduction compared to SentencePiece C++. The optimized tokenizer targets large vocabularies, such as XLM-RoBERTa's 250K-token Unigram vocabulary, which is commonly used in ranking and retrieval tasks. AI

IMPACT Accelerates inference for AI models by reducing tokenization latency on CPUs.

RANK_REASON Open-sourcing of a performance-optimized component for an AI product.

Read on X — Perplexity →

AI-generated summary · Google Gemini · from 4 sources. How we write summaries →

Perplexity AI open-sources Unigram tokenizer for 5x speedup

COVERAGE [4]

  1. X — Perplexity TIER_1 English(EN) · perplexity_ai ·

    Read more about improving Unigram tokenizer CPU performance on our blog:

    Read more about improving Unigram tokenizer CPU performance on our blog: https://t.co/8E95gOXP1g

  2. X — Perplexity TIER_1 English(EN) · perplexity_ai ·

    At production input lengths, the encoder cuts p50 latency by roughly 5× vs. HuggingFace tokenizers, 2× vs. SentencePiece C++, and 1.5× vs. IREE C.

    At production input lengths, the encoder cuts p50 latency by roughly 5× vs. HuggingFace tokenizers, 2× vs. SentencePiece C++, and 1.5× vs. IREE C. At 514 tokens, it runs in 63 µs with zero heap allocations. https://t.co/PBg08lAXc8

  3. X — Perplexity TIER_1 English(EN) · perplexity_ai ·

    The work targets XLM-RoBERTa’s 250K-token Unigram vocabulary, commonly used for ranking and retrieval.

    The work targets XLM-RoBERTa’s 250K-token Unigram vocabulary, commonly used for ranking and retrieval. The encoder produces the same tokens as the reference implementation, but avoids rebuilding strings and chasing hash maps while deciding how text should be split. https://t.co/…

  4. X — Perplexity TIER_1 English(EN) · perplexity_ai ·

    We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x.

    We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x. Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency. https://t.co/QUnHeiho56 https://t.co/Oh29f1lo51