PulseAugur
实时 22:48:27

New research explores speculative decoding for faster LLM inference

Multiple research papers published on arXiv explore advancements in speculative decoding for Large Language Models (LLMs). These studies focus on improving inference speed and efficiency by using a smaller "draft" model to propose tokens that a larger "target" model then verifies. Techniques include developing interpretable latency models for production systems, optimizing drafter policies using reinforcement learning, and modifying model architectures to prevent phenomena like "attention drift." The research aims to enhance accuracy and speedup across various benchmarks and model families. AI

影响 These papers introduce novel techniques to significantly accelerate LLM inference, potentially leading to more efficient and cost-effective deployment of large language models in production environments.

排序理由 Multiple academic papers published on arXiv detailing new methods and analyses for speculative decoding in LLMs.

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

New research explores speculative decoding for faster LLM inference

报道来源 [6]

  1. arXiv cs.LG TIER_1 English(EN) · Alexandre Marques ·

    An Interpretable Latency Model for Speculative Decoding in LLM Serving

    Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the…

  2. arXiv cs.CL TIER_1 English(EN) · Xing Sun ·

    Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

    Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an ea…

  3. arXiv cs.CL TIER_1 English(EN) · Zhou Yu ·

    Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

    Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Facto…

  4. arXiv cs.CL TIER_1 English(EN) · Alexander Samarin ·

    SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern ar…

  5. arXiv cs.AI TIER_1 English(EN) · Stephen Xia ·

    Attention Drift: What Autoregressive Speculative Decoding Models Learn

    Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter gen…

  6. arXiv cs.LG TIER_1 English(EN) · Hao Zhang ·

    Future Validity is the Missing Statistic: From Impossibility to $Φ$-Estimation for Grammar-Faithful Speculative Decoding

    Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejecti…