English(EN) When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

新方法通过改进推测解码来加速大语言模型推理

作者 PulseAugur 编辑部 · [5 个来源] · 2026-04-29 08:25

研究人员正在开发新方法来加速大语言模型（LLM）推理，这个过程通常会因顺序解码而变慢。几篇近期论文探讨了推测解码技术，该技术使用一个较小的“草稿”模型来提议词元，然后由一个较大的“目标”模型进行验证。创新包括结合多草稿和块验证策略，利用KV缓存获取更丰富的草稿信号，以及开发接受语义正确但不完全匹配的无训练方法。这些方法旨在显著提高解码速度，同时保持输出质量和跨不同模型及任务的泛化能力。 AI

影响新的推测解码方法有望显著加快大语言模型推理速度，从而降低运营成本并支持实时应用。

排序理由多篇在arXiv上发表的学术论文介绍了用于大语言模型推理中推测解码的新技术。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.CL TIER_1 English(EN) · Yijun Lin, Jinhao Sheng, Qingyue Cai, Feng Zhou · 2026-04-30 04:00

SpecTr-GBV：多草稿块验证加速投机解码

arXiv:2604.25925v1 Announce Type: new Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are select…
arXiv cs.CL TIER_1 English(EN) · Tianyu Liu, Yuhao Shen, Xinyi Hu, Baolin Zhang, Hengxin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, MingCheng Wan · 2026-04-30 04:00

隐藏状态漂移时：KV缓存能否挽救长距离推测式解码？

arXiv:2604.26412v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mis…
arXiv cs.CL TIER_1 English(EN) · Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, Xiaoyan Sun · 2026-04-30 04:00

LogitSpec：通过下一个下一个Token的推测来加速基于检索的推测解码

arXiv:2507.01449v3 Announce Type: replace Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many …
arXiv cs.CL TIER_1 English(EN) · Jinze Li, Yixing Xu, Guanchen Li, Shuo Yang, Jinfeng Xu, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum · 2026-04-30 04:00

无训练的松散推测解码：接受超越精确匹配的语义正确草稿

arXiv:2511.22972v3 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate t…
arXiv cs.CL TIER_1 English(EN) · MingCheng Wan · 2026-04-29 08:25

隐藏状态漂移时：KV缓存能否挽救长程推测解码？

Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a…

报道来源 [5]

SpecTr-GBV：多草稿块验证加速投机解码

隐藏状态漂移时：KV缓存能否挽救长距离推测式解码？

LogitSpec：通过下一个下一个Token的推测来加速基于检索的推测解码

无训练的松散推测解码：接受超越精确匹配的语义正确草稿

隐藏状态漂移时：KV缓存能否挽救长程推测解码？

相关实体

相关话题