byte-pair encoding
PulseAugur coverage of byte-pair encoding — every cluster mentioning byte-pair encoding across labs, papers, and developer communities, ranked by signal.
9 day(s) with sentiment data
-
MinGram tokenizer simplifies training, boosts compression and alignment
Researchers have introduced MinGram, a new minimalist unigram tokenizer designed to simplify the training process while maintaining high compression and morphological alignment. MinGram achieves this by using a BPE-deri…
-
LLMs struggle with letter counting due to tokenization, not poor spelling
Large language models struggle with tasks like counting letters or rhyming because their input is processed by a tokenizer, typically using Byte Pair Encoding (BPE), which converts text into integer token IDs. This proc…
-
New benchmark QuechuaTok highlights tokenization limits for agglutinative languages
A new benchmark called QuechuaTok has been developed to evaluate tokenization strategies for agglutinative, low-resource languages. Standard metrics like fertility rate are insufficient, so QuechuaTok introduces morphol…
-
minbpe vs turboBPE: Faster LLM Tokenizer Training Explained
The article compares two Python libraries for training Byte Pair Encoding (BPE) tokenizers, essential for large language models like Llama and Mistral AI. minbpe, developed by Andrej Karpathy, is presented as an excelle…
-
New IHUBERT model advances Persian language understanding with curated pretraining
Researchers have developed IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a 45 GB curated dataset from the Sepahr-Danesh collection, totaling approximately 7-8 billion…
-
New framework TOTEN improves tokenization of technical notation
Researchers have developed TOTEN, a knowledge-based ontological tokenization framework designed to improve the semantic understanding of technical notation in Brazilian Portuguese. Unlike traditional byte-pair encoding,…
-
minbpe vs turboBPE: Faster BPE tokenization for LLMs
Two distinct implementations of the Byte-Pair Encoding (BPE) tokenizer algorithm are compared: minbpe, a pure Python educational tool, and turboBPE, a significantly faster C-extension based implementation. While minbpe …
-
Byte Pair Encoding explained: Building LLM tokenization from scratch
This article explains Byte Pair Encoding (BPE), a crucial tokenization technique for Large Language Models (LLMs). BPE addresses the limitations of word-level tokenization (Out-Of-Vocabulary words) and character-level t…
-
LLMs Explained: From Data to Text Generation
This article provides a detailed explanation of how Large Language Models (LLMs) function, breaking down the complex pipeline involved in their operation. It covers the essential stages from data preparation and tokeniz…
-
Researcher proposes semantic tokenization for language models
A researcher has proposed a novel tokenization scheme for language models where the token geometry itself reflects semantic relationships, moving beyond current methods that primarily capture statistical structure. This…
-
New BPE tokenization algorithm offers 3x speedup
Researchers have developed a new algorithm for incremental Byte Pair Encoding (BPE) tokenization, designed to improve efficiency in large language model pipelines. This method processes input bytes in logarithmic time, …
-
Kronecker Embeddings slash language model parameters, boost performance
Researchers have developed Kronecker Embeddings, a novel method for representing tokens in language models that significantly reduces the number of trainable parameters. This approach replaces large embedding tables wit…
-
LLM tokenizers punish random character deletion, increasing costs
An AI sysadmin discovered that randomly deleting characters from LLM prompts to save on token costs actually increases the token count. This occurs because tokenizers, like Byte Pair Encoding (BPE) and SentencePiece, ar…
-
New ConvexTok algorithm optimizes NLP tokenization using convex optimization
Researchers have developed a new tokenization algorithm called ConvexTok, which uses convex optimization to construct tokenizers. Unlike existing greedy methods like BPE and Unigram, ConvexTok considers the entire vocab…
-
New ToaST tokenizer cuts token counts by over 11%
Researchers have developed a new subword tokenization method called Tokenization with Split Trees (ToaST). This method optimizes compression by recursively splitting text into binary trees and selecting vocabulary based…
-
Paper analyzes how data representation impacts Transformer context
A new paper analyzes how different representations of data, such as bytes, characters, or subword tokens, affect the performance of Transformer models. The research introduces 'fragmentation' to explain why smaller unit…
-
New research shows model size scales with data bytes, not tokens, for optimal compute
A new paper explores the impact of token granularity on language model scaling laws. Researchers trained 988 models with varying parameter counts and compression rates to investigate how tokenization affects compute eff…
-
New research boosts LLM edge inference speed and cross-model circuit transfer
Researchers have developed Peek2, a new pretokenizer for Byte-level BPE tokenizers that offers a significant speedup for LLM inference on edge devices. This drop-in replacement increases throughput by up to 2.48x in mic…
-
Interactive guide explains how large language models like ChatGPT are built
A new interactive visual guide, based on Andrej Karpathy's lecture, explains the intricate process of building large language models. It details the journey from collecting vast amounts of internet text to the final sta…