PulseAugur
EN
LIVE 15:22:32

Graft and FlexDraft boost LLM speed with new speculative decoding methods

Two new research papers, Graft and FlexDraft, introduce advanced techniques for speculative decoding to accelerate large language model inference. Graft combines pruning and retrieval to fill gaps left by pruned branches, achieving significant speedups without training. FlexDraft employs attention tuning and bonus-guided calibration to adapt flexibly across different batch sizes, mitigating draft verification mismatches and improving throughput. These methods aim to overcome the latency-cost trap in LLM deployment by allowing high-quality responses at speeds closer to smaller models. AI

IMPACT These advancements in speculative decoding could significantly reduce LLM inference latency and cost, enabling faster and more efficient deployment of AI applications.

RANK_REASON Two research papers introduce novel techniques for speculative decoding in LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 13 sources. How we write summaries →

Graft and FlexDraft boost LLM speed with new speculative decoding methods

COVERAGE [13]

  1. arXiv cs.CL TIER_1 English(EN) · Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma ·

    MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

    arXiv:2605.26444v1 Announce Type: new Abstract: Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary prunin…

  2. arXiv cs.CL TIER_1 English(EN) · Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu ·

    AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

    arXiv:2512.11280v2 Announce Type: replace Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a sm…

  3. arXiv cs.AI TIER_1 English(EN) · Avinash Kumar, Sujay Sanghavi, Poulami Das ·

    HiSpec: Hierarchical Speculative Decoding for LLMs

    arXiv:2510.01336v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token …

  4. arXiv cs.CL TIER_1 English(EN) · Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum ·

    Beyond the Target: From Imitation to Collaboration in Speculative Decoding

    arXiv:2605.24793v1 Announce Type: new Abstract: Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the…

  5. arXiv cs.CL TIER_1 English(EN) · Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou ·

    SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

    arXiv:2605.07243v2 Announce Type: replace Abstract: Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as…

  6. arXiv cs.AI TIER_1 English(EN) · Cong Wang ·

    Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

    Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottl…

  7. arXiv cs.CL TIER_1 English(EN) · Linfeng Zhang ·

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …

  9. Together AI blog TIER_1 English(EN) ·

    Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

    Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.

  10. Together AI blog TIER_1 English(EN) ·

    Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

  11. MarkTechPost TIER_1 English(EN) · Michal Sutter ·

    Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

    <p>The EAGLE team, vLLM, and TorchSpec jointly release EAGLE 3.1 to fix speculative decoding instability in production.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/27/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference/"…

  12. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inferenc

    EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inference. The update adds FC normalisation and post-norm hidden states, delivering up to 2x longer acceptance length in long-co…

  13. dev.to — LLM tag TIER_1 English(EN) · Ken W Alger ·

    The Speculative Decoding Pattern

    <h1>Pattern Defined</h1> <p><strong>Precise Definition:</strong> Speculative Decoding is an optimization pattern where a <br /> smaller, "draft" model predicts multiple upcoming tokens in parallel, which are <br /> then verified or corrected by a larger "oracle" model in a single…