ENTITY byte-pair encoding

byte-pair encoding

PulseAugur coverage of byte-pair encoding — every cluster mentioning byte-pair encoding across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

19 over 90d

Releases · 30d

0 over 90d

Papers · 30d

15 over 90d

TIER MIX · 90D

research 9
tool 9
commentary 1

TOPICS

RELATIONSHIPS

SENTIMENT · 30D

9 day(s) with sentiment data

RECENT · PAGE 1/1 · 19 TOTAL

RESEARCH · CL_111600 · Jun 25 · 13:31

MinGram tokenizer simplifies training, boosts compression and alignment

Researchers have introduced MinGram, a new minimalist unigram tokenizer designed to simplify the training process while maintaining high compression and morphological alignment. MinGram achieves this by using a BPE-deri…
COMMENTARY · CL_106973 · Jun 23 · 17:13

LLMs struggle with letter counting due to tokenization, not poor spelling

Large language models struggle with tasks like counting letters or rhyming because their input is processed by a tokenizer, typically using Byte Pair Encoding (BPE), which converts text into integer token IDs. This proc…
RESEARCH · CL_107826 · Jun 22 · 21:04

New benchmark QuechuaTok highlights tokenization limits for agglutinative languages

A new benchmark called QuechuaTok has been developed to evaluate tokenization strategies for agglutinative, low-resource languages. Standard metrics like fertility rate are insufficient, so QuechuaTok introduces morphol…
TOOL · CL_106192 · Jun 20 · 08:35

minbpe vs turboBPE: Faster LLM Tokenizer Training Explained

The article compares two Python libraries for training Byte Pair Encoding (BPE) tokenizers, essential for large language models like Llama and Mistral AI. minbpe, developed by Andrej Karpathy, is presented as an excelle…
RESEARCH · CL_99595 · Jun 18 · 11:10

New IHUBERT model advances Persian language understanding with curated pretraining

Researchers have developed IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a 45 GB curated dataset from the Sepahr-Danesh collection, totaling approximately 7-8 billion…
RESEARCH · CL_99667 · Jun 17 · 22:06

New framework TOTEN improves tokenization of technical notation

Researchers have developed TOTEN, a knowledge-based ontological tokenization framework designed to improve the semantic understanding of technical notation in Brazilian Portuguese. Unlike traditional byte-pair encoding,…
TOOL · CL_95561 · Jun 17 · 01:10

minbpe vs turboBPE: Faster BPE tokenization for LLMs

Two distinct implementations of the Byte-Pair Encoding (BPE) tokenizer algorithm are compared: minbpe, a pure Python educational tool, and turboBPE, a significantly faster C-extension based implementation. While minbpe …
TOOL · CL_76751 · Jun 7 · 23:55

Byte Pair Encoding explained: Building LLM tokenization from scratch

This article explains Byte Pair Encoding (BPE), a crucial tokenization technique for Large Language Models (LLMs). BPE addresses the limitations of word-level tokenization (Out-Of-Vocabulary words) and character-level t…
RESEARCH · CL_76045 · Jun 7 · 00:53

LLMs Explained: From Data to Text Generation

This article provides a detailed explanation of how Large Language Models (LLMs) function, breaking down the complex pipeline involved in their operation. It covers the essential stages from data preparation and tokeniz…
TOOL · CL_69108 · Jun 3 · 15:27

Researcher proposes semantic tokenization for language models

A researcher has proposed a novel tokenization scheme for language models where the token geometry itself reflects semantic relationships, moving beyond current methods that primarily capture statistical structure. This…
TOOL · CL_62858 · Jun 1 · 04:00

New BPE tokenization algorithm offers 3x speedup

Researchers have developed a new algorithm for incremental Byte Pair Encoding (BPE) tokenization, designed to improve efficiency in large language model pipelines. This method processes input bytes in logarithmic time, …
TOOL · CL_58840 · May 29 · 04:00

Kronecker Embeddings slash language model parameters, boost performance

Researchers have developed Kronecker Embeddings, a novel method for representing tokens in language models that significantly reduces the number of trainable parameters. This approach replaces large embedding tables wit…
TOOL · CL_45717 · May 23 · 10:55

LLM tokenizers punish random character deletion, increasing costs

An AI sysadmin discovered that randomly deleting characters from LLM prompts to save on token costs actually increases the token count. This occurs because tokenizers, like Byte Pair Encoding (BPE) and SentencePiece, ar…
RESEARCH · CL_43967 · May 21 · 17:59

New ConvexTok algorithm optimizes NLP tokenization using convex optimization

Researchers have developed a new tokenization algorithm called ConvexTok, which uses convex optimization to construct tokenizers. Unlike existing greedy methods like BPE and Unigram, ConvexTok considers the entire vocab…
RESEARCH · CL_43970 · May 21 · 16:46

New ToaST tokenizer cuts token counts by over 11%

Researchers have developed a new subword tokenization method called Tokenization with Split Trees (ToaST). This method optimizes compression by recursively splitting text into binary trees and selecting vocabulary based…
RESEARCH · CL_30772 · May 13 · 13:08

Paper analyzes how data representation impacts Transformer context

A new paper analyzes how different representations of data, such as bytes, characters, or subword tokens, affect the performance of Transformer models. The research introduces 'fragmentation' to explain why smaller unit…
TOOL · CL_15851 · May 5 · 04:00

New research shows model size scales with data bytes, not tokens, for optimal compute

A new paper explores the impact of token granularity on language model scaling laws. Researchers trained 988 models with varying parameter counts and compression rates to investigate how tokenization affects compute eff…
RESEARCH · CL_14484 · Apr 27 · 10:49

New research boosts LLM edge inference speed and cross-model circuit transfer

Researchers have developed Peek2, a new pretokenizer for Byte-level BPE tokenizers that offers a significant speedup for LLM inference on edge devices. This drop-in replacement increases throughput by up to 2.48x in mic…
TOOL · CL_17378 · Apr 24 · 06:48

Interactive guide explains how large language models like ChatGPT are built

A new interactive visual guide, based on Andrej Karpathy's lecture, explains the intricate process of building large language models. It details the journey from collecting vast amounts of internet text to the final sta…

MinGram tokenizer simplifies training, boosts compression and alignment

LLMs struggle with letter counting due to tokenization, not poor spelling

New benchmark QuechuaTok highlights tokenization limits for agglutinative languages

minbpe vs turboBPE: Faster LLM Tokenizer Training Explained

New IHUBERT model advances Persian language understanding with curated pretraining

New framework TOTEN improves tokenization of technical notation

minbpe vs turboBPE: Faster BPE tokenization for LLMs

Byte Pair Encoding explained: Building LLM tokenization from scratch

LLMs Explained: From Data to Text Generation

Researcher proposes semantic tokenization for language models

New BPE tokenization algorithm offers 3x speedup

Kronecker Embeddings slash language model parameters, boost performance

LLM tokenizers punish random character deletion, increasing costs

New ConvexTok algorithm optimizes NLP tokenization using convex optimization

New ToaST tokenizer cuts token counts by over 11%

Paper analyzes how data representation impacts Transformer context

New research shows model size scales with data bytes, not tokens, for optimal compute

New research boosts LLM edge inference speed and cross-model circuit transfer

Interactive guide explains how large language models like ChatGPT are built