PulseAugur
LIVE 12:22:46
research · [6 sources] ·
0
research

New research explores speculative decoding for faster LLM inference

Multiple research papers published on arXiv explore advancements in speculative decoding for Large Language Models (LLMs). These studies focus on improving inference speed and efficiency by using a smaller "draft" model to propose tokens that a larger "target" model then verifies. Techniques include developing interpretable latency models for production systems, optimizing drafter policies using reinforcement learning, and modifying model architectures to prevent phenomena like "attention drift." The research aims to enhance accuracy and speedup across various benchmarks and model families. AI

Summary written by gemini-2.5-flash-lite from 6 sources. How we write summaries →

IMPACT These papers introduce novel techniques to significantly accelerate LLM inference, potentially leading to more efficient and cost-effective deployment of large language models in production environments.

RANK_REASON Multiple academic papers published on arXiv detailing new methods and analyses for speculative decoding in LLMs.

Read on arXiv cs.LG →

COVERAGE [6]

  1. arXiv cs.LG TIER_1 · Alexandre Marques ·

    An Interpretable Latency Model for Speculative Decoding in LLM Serving

    Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the…

  2. arXiv cs.CL TIER_1 · Xing Sun ·

    Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

    Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an ea…

  3. arXiv cs.CL TIER_1 · Zhou Yu ·

    Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

    Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Facto…

  4. arXiv cs.CL TIER_1 · Alexander Samarin ·

    SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern ar…

  5. arXiv cs.AI TIER_1 · Stephen Xia ·

    Attention Drift: What Autoregressive Speculative Decoding Models Learn

    Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter gen…

  6. arXiv cs.LG TIER_1 · Hao Zhang ·

    Future Validity is the Missing Statistic: From Impossibility to $Φ$-Estimation for Grammar-Faithful Speculative Decoding

    Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejecti…