PulseAugur
EN
LIVE 10:31:37
ENTITY byte-pair encoding

byte-pair encoding

PulseAugur coverage of byte-pair encoding — every cluster mentioning byte-pair encoding across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
19
19 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
15
15 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
SENTIMENT · 30D

9 day(s) with sentiment data

RECENT · PAGE 1/1 · 19 TOTAL
  1. RESEARCH · CL_111600 ·

    MinGram tokenizer simplifies training, boosts compression and alignment

    Researchers have introduced MinGram, a new minimalist unigram tokenizer designed to simplify the training process while maintaining high compression and morphological alignment. MinGram achieves this by using a BPE-deri…

  2. COMMENTARY · CL_106973 ·

    LLMs struggle with letter counting due to tokenization, not poor spelling

    Large language models struggle with tasks like counting letters or rhyming because their input is processed by a tokenizer, typically using Byte Pair Encoding (BPE), which converts text into integer token IDs. This proc…

  3. RESEARCH · CL_107826 ·

    New benchmark QuechuaTok highlights tokenization limits for agglutinative languages

    A new benchmark called QuechuaTok has been developed to evaluate tokenization strategies for agglutinative, low-resource languages. Standard metrics like fertility rate are insufficient, so QuechuaTok introduces morphol…

  4. TOOL · CL_106192 ·

    minbpe vs turboBPE: Faster LLM Tokenizer Training Explained

    The article compares two Python libraries for training Byte Pair Encoding (BPE) tokenizers, essential for large language models like Llama and Mistral AI. minbpe, developed by Andrej Karpathy, is presented as an excelle…

  5. RESEARCH · CL_99595 ·

    New IHUBERT model advances Persian language understanding with curated pretraining

    Researchers have developed IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a 45 GB curated dataset from the Sepahr-Danesh collection, totaling approximately 7-8 billion…

  6. RESEARCH · CL_99667 ·

    New framework TOTEN improves tokenization of technical notation

    Researchers have developed TOTEN, a knowledge-based ontological tokenization framework designed to improve the semantic understanding of technical notation in Brazilian Portuguese. Unlike traditional byte-pair encoding,…

  7. TOOL · CL_95561 ·

    minbpe vs turboBPE: Faster BPE tokenization for LLMs

    Two distinct implementations of the Byte-Pair Encoding (BPE) tokenizer algorithm are compared: minbpe, a pure Python educational tool, and turboBPE, a significantly faster C-extension based implementation. While minbpe …

  8. TOOL · CL_76751 ·

    Byte Pair Encoding explained: Building LLM tokenization from scratch

    This article explains Byte Pair Encoding (BPE), a crucial tokenization technique for Large Language Models (LLMs). BPE addresses the limitations of word-level tokenization (Out-Of-Vocabulary words) and character-level t…

  9. RESEARCH · CL_76045 ·

    LLMs Explained: From Data to Text Generation

    This article provides a detailed explanation of how Large Language Models (LLMs) function, breaking down the complex pipeline involved in their operation. It covers the essential stages from data preparation and tokeniz…

  10. TOOL · CL_69108 ·

    Researcher proposes semantic tokenization for language models

    A researcher has proposed a novel tokenization scheme for language models where the token geometry itself reflects semantic relationships, moving beyond current methods that primarily capture statistical structure. This…

  11. TOOL · CL_62858 ·

    New BPE tokenization algorithm offers 3x speedup

    Researchers have developed a new algorithm for incremental Byte Pair Encoding (BPE) tokenization, designed to improve efficiency in large language model pipelines. This method processes input bytes in logarithmic time, …

  12. TOOL · CL_58840 ·

    Kronecker Embeddings slash language model parameters, boost performance

    Researchers have developed Kronecker Embeddings, a novel method for representing tokens in language models that significantly reduces the number of trainable parameters. This approach replaces large embedding tables wit…

  13. TOOL · CL_45717 ·

    LLM tokenizers punish random character deletion, increasing costs

    An AI sysadmin discovered that randomly deleting characters from LLM prompts to save on token costs actually increases the token count. This occurs because tokenizers, like Byte Pair Encoding (BPE) and SentencePiece, ar…

  14. RESEARCH · CL_43967 ·

    New ConvexTok algorithm optimizes NLP tokenization using convex optimization

    Researchers have developed a new tokenization algorithm called ConvexTok, which uses convex optimization to construct tokenizers. Unlike existing greedy methods like BPE and Unigram, ConvexTok considers the entire vocab…

  15. RESEARCH · CL_43970 ·

    New ToaST tokenizer cuts token counts by over 11%

    Researchers have developed a new subword tokenization method called Tokenization with Split Trees (ToaST). This method optimizes compression by recursively splitting text into binary trees and selecting vocabulary based…

  16. RESEARCH · CL_30772 ·

    Paper analyzes how data representation impacts Transformer context

    A new paper analyzes how different representations of data, such as bytes, characters, or subword tokens, affect the performance of Transformer models. The research introduces 'fragmentation' to explain why smaller unit…

  17. TOOL · CL_15851 ·

    New research shows model size scales with data bytes, not tokens, for optimal compute

    A new paper explores the impact of token granularity on language model scaling laws. Researchers trained 988 models with varying parameter counts and compression rates to investigate how tokenization affects compute eff…

  18. RESEARCH · CL_14484 ·

    New research boosts LLM edge inference speed and cross-model circuit transfer

    Researchers have developed Peek2, a new pretokenizer for Byte-level BPE tokenizers that offers a significant speedup for LLM inference on edge devices. This drop-in replacement increases throughput by up to 2.48x in mic…

  19. TOOL · CL_17378 ·

    Interactive guide explains how large language models like ChatGPT are built

    A new interactive visual guide, based on Andrej Karpathy's lecture, explains the intricate process of building large language models. It details the journey from collecting vast amounts of internet text to the final sta…