PulseAugur
实时 14:45:40
English(EN) FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Graft 和 FlexDraft 通过新的推测性解码方法提升 LLM 速度

两篇新研究论文 GraftFlexDraft 引入了先进的推测性解码技术,以加速大型语言模型推理。Graft 结合了剪枝和检索,以填补剪枝分支留下的空白,在无需训练的情况下实现了显著的加速。FlexDraft 采用注意力调整和奖励引导校准,以灵活适应不同的批处理大小,缓解草稿验证不匹配问题并提高吞吐量。这些方法旨在通过允许以接近小型模型的速度提供高质量响应,来克服 LLM 部署中的延迟-成本陷阱。 AI

影响 推测性解码的这些进展可以显著降低 LLM 推理的延迟和成本,从而实现更快、更高效的 AI 应用部署。

排序理由 两篇研究论文介绍了 LLM 推测性解码的新颖技术。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 13 个来源。 我们如何撰写摘要 →

Graft 和 FlexDraft 通过新的推测性解码方法提升 LLM 速度

报道来源 [13]

  1. arXiv cs.CL TIER_1 English(EN) · Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma ·

    MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

    arXiv:2605.26444v1 Announce Type: new Abstract: Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary prunin…

  2. arXiv cs.CL TIER_1 English(EN) · Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu ·

    AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

    arXiv:2512.11280v2 Announce Type: replace Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a sm…

  3. arXiv cs.AI TIER_1 English(EN) · Avinash Kumar, Sujay Sanghavi, Poulami Das ·

    HiSpec: Hierarchical Speculative Decoding for LLMs

    arXiv:2510.01336v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token …

  4. arXiv cs.CL TIER_1 English(EN) · Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum ·

    Beyond the Target: From Imitation to Collaboration in Speculative Decoding

    arXiv:2605.24793v1 Announce Type: new Abstract: Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the…

  5. arXiv cs.CL TIER_1 English(EN) · Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou ·

    SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

    arXiv:2605.07243v2 Announce Type: replace Abstract: Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as…

  6. arXiv cs.AI TIER_1 English(EN) · Cong Wang ·

    Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

    Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottl…

  7. arXiv cs.CL TIER_1 English(EN) · Linfeng Zhang ·

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …

  9. Together AI blog TIER_1 English(EN) ·

    Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

    Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.

  10. Together AI blog TIER_1 English(EN) ·

    Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

  11. MarkTechPost TIER_1 English(EN) · Michal Sutter ·

    Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

    <p>The EAGLE team, vLLM, and TorchSpec jointly release EAGLE 3.1 to fix speculative decoding instability in production.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/27/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference/"…

  12. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inferenc

    EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inference. The update adds FC normalisation and post-norm hidden states, delivering up to 2x longer acceptance length in long-co…

  13. dev.to — LLM tag TIER_1 English(EN) · Ken W Alger ·

    The Speculative Decoding Pattern

    <h1>Pattern Defined</h1> <p><strong>Precise Definition:</strong> Speculative Decoding is an optimization pattern where a <br /> smaller, "draft" model predicts multiple upcoming tokens in parallel, which are <br /> then verified or corrected by a larger "oracle" model in a single…