新研究探讨用于 LLM 推理加速的推测性解码

作者 PulseAugur 编辑部 · [6 个来源] · 2026-05-08 13:08

arXiv 上发表的多篇研究论文探讨了大型语言模型 (LLM) 推测性解码的进展。这些研究侧重于通过使用一个较小的“草稿”模型来生成 token，然后由一个较大的“目标”模型进行验证，从而提高推理速度和效率。技术包括为生产系统开发可解释的延迟模型、使用强化学习优化草稿策略以及修改模型架构以防止“注意力漂移”等现象。研究旨在提高各种基准测试和模型系列的准确性和加速效果。 AI

影响这些论文引入了显著加速 LLM 推理的新技术，有望在生产环境中更高效、更经济地部署大型语言模型。

排序理由 arXiv 上发表的多篇学术论文，详细介绍了 LLM 推测性解码的新方法和分析。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

arXiv cs.LG TIER_1 English(EN) · Alexandre Marques · 2026-05-14 16:45

LLM服务中投机解码的可解释延迟模型

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the…
arXiv cs.CL TIER_1 English(EN) · Xing Sun · 2026-05-14 15:41

面向自适应窗口的投机解码的性能驱动策略优化

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an ea…
arXiv cs.CL TIER_1 English(EN) · Zhou Yu · 2026-05-14 03:15

通过推测性解码实现无因子化误差的离散扩散语言模型

Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Facto…
arXiv cs.CL TIER_1 English(EN) · Alexander Samarin · 2026-05-11 12:22

SlimSpec：用于加速推测性解码的低秩草稿LM头

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern ar…
arXiv cs.AI TIER_1 English(EN) · Stephen Xia · 2026-05-11 05:08

Attention Drift：自回归推测解码模型学习到什么

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter gen…
arXiv cs.LG TIER_1 English(EN) · Hao Zhang · 2026-05-08 13:08

未来有效性是缺失的统计量：从不可能到用于语法忠实推测解码的 $Φ$-估计

Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejecti…

报道来源 [6]

LLM服务中投机解码的可解释延迟模型

面向自适应窗口的投机解码的性能驱动策略优化

通过推测性解码实现无因子化误差的离散扩散语言模型

SlimSpec：用于加速推测性解码的低秩草稿LM头

Attention Drift：自回归推测解码模型学习到什么

未来有效性是缺失的统计量：从不可能到用于语法忠实推测解码的 $Φ$-估计

相关实体

相关话题