English(EN) FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

新方法通过推测性解码加速大语言模型推理

作者 PulseAugur 编辑部 · [60 个来源] · 2025-05-12 00:00

研究人员开发了多种通过推测性解码加速大语言模型（LLM）推理的新方法。AdaPLD 通过使用语义相似性和分支假设来改进检索和草稿构建，实现了高达 3.10 倍的加速。SSSD 结合了 n-gram 匹配和面向硬件的推测，在无需训练的情况下将延迟降低了高达 2.9 倍。D^2SD 使用双扩散模型和置信度引导的前缀树来提高接受率，而 TAPS 则优化了扩散草稿解码的前缀树选择，实现了高达 7.9 倍的加速。KnapSpec 将草稿模型选择视为一个背包问题以最大化吞吐量，实现了高达 1.47 倍的加速，而 Vegas 则使用验证引导的稀疏注意力来提高解码吞吐量。此外，LK Losses 在训练期间直接优化接受率，使平均接受长度提高了 8-10%。 AI

影响这些在推测性解码方面的进展有望为大语言模型推理带来显著的加速和效率提升，从而可能降低成本并提高可访问性。

排序理由多篇在 arXiv 上发表的研究论文详细介绍了大语言模型推测性解码的新方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 60 个来源。我们如何撰写摘要 →

报道来源 [60]

arXiv cs.AI TIER_1 English(EN) · Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Christopher Lott, Fatih Porikli, Mingu Lee · 2026-06-12 04:00

构建未来：通过校准草稿图的扩散大模型推测性解码

arXiv:2509.18085v4 Announce Type: replace-cross Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spi…
arXiv cs.AI TIER_1 English(EN) · Yuchen Xian, Yang He, Yunqiu Xu, Yi Yang · 2026-06-11 04:00

VIA-SD：用于推测性解码的通过模型内路由进行验证

arXiv:2606.12243v1 Announce Type: cross Abstract: Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or ful…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 15:45

VIA-SD：用于推测性解码的通过模型内路由进行验证

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected token…
arXiv cs.AI TIER_1 English(EN) · Yi Yang · 2026-06-10 15:45

VIA-SD：用于推测性解码的通过模型内路由进行验证

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected token…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

VIA-SD：用于推测性解码的通过模型内路由进行验证

VIA-SD introduces a multi-tier speculative decoding framework that uses intra-model routing to reduce verification costs by employing slim submodels for medium-confidence token validation, achieving significant speedups over traditional approaches.
arXiv cs.AI TIER_1 English(EN) · Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou · 2026-06-09 04:00

变分推测解码：从令牌似然到序列接受重新思考草稿训练

arXiv:2602.05774v4 Announce Type: replace-cross Abstract: Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled dra…
arXiv cs.AI TIER_1 English(EN) · Young D. Kwon, Miles Williams, Rui Li, Alexandros Kouris, Stylianos I. Venieris · 2026-06-09 04:00

WhiFlash：通过令牌级跨范式路由加速推测性解码

arXiv:2606.07710v1 Announce Type: cross Abstract: The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic workloads. While speculative decoding (SD) accelerates inference, current approaches rely on…
arXiv cs.CL TIER_1 English(EN) · Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang · 2026-06-05 04:00

AdaPLD：用于高效无模型推测解码的自适应检索与重用

arXiv:2606.05742v1 Announce Type: new Abstract: Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and mo…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 06:09

AdaPLD：用于高效无模型推测解码的自适应检索与重用

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, …
arXiv cs.LG TIER_1 English(EN) · Liyuan Zhang, Jiarui Zhang, Jinwei Yao, Ran Yan, Yuchen Yang, Jiahao Zhang, Tongkai Yang, Yi Wu, Binhang Yuan · 2026-06-04 04:00

D^2SD：利用双扩散草稿模型加速投机解码

arXiv:2606.04446v1 Announce Type: cross Abstract: Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of token…
arXiv cs.AI TIER_1 English(EN) · Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. M\"uller, Lukas Cavigelli · 2026-06-04 04:00

SSSD：简单可扩展的推测性解码

arXiv:2411.05894v3 Announce Type: replace-cross Abstract: Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achi…
arXiv cs.AI TIER_1 English(EN) · Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han · 2026-06-03 04:00

KnapSpec：自投机解码通过自适应层选择作为背包问题

arXiv:2602.20217v2 Announce Type: replace-cross Abstract: Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attent…
arXiv cs.LG TIER_1 English(EN) · Peer Rheinboldt, Fr\'ed\'eric Berdoz, Roger Wattenhofer · 2026-06-03 04:00

TreeFlash：用于更快推测性解码的并行AR近似

arXiv:2606.03819v1 Announce Type: new Abstract: One-shot block drafters for speculative decoding generate the full draft in a single forward pass, achieving strong throughput by eliminating sequential token generation. However, they predict each draft token conditioned only on th…
arXiv cs.LG TIER_1 English(EN) · Roger Wattenhofer · 2026-06-02 16:00

TreeFlash：用于更快推测性解码的并行AR近似

One-shot block drafters for speculative decoding generate the full draft in a single forward pass, achieving strong throughput by eliminating sequential token generation. However, they predict each draft token conditioned only on the prefix context, with no dependence on previous…
arXiv cs.CL TIER_1 English(EN) · Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai · 2026-06-02 04:00

用于投机性解码的成本感知扩散草稿树

arXiv:2606.01813v1 Announce Type: new Abstract: Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yieldin…
arXiv cs.AI TIER_1 English(EN) · Zhuoyu Wang, Junnan Huang, Xinyu Chen · 2026-06-02 04:00

TAPS：面向扩散模型草拟的推测解码的目标感知前缀树选择

arXiv:2606.00487v1 Announce Type: new Abstract: Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. Ho…
arXiv cs.AI TIER_1 English(EN) · Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang · 2026-06-02 04:00

BudgetDraft：面向稀疏 KV 推理的感知接受率多视图训练

arXiv:2606.00144v1 Announce Type: cross Abstract: Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU …
arXiv cs.AI TIER_1 English(EN) · Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard · 2026-06-02 04:00

混合验证解码：学习在推测性解码中分配验证

arXiv:2606.01019v1 Announce Type: cross Abstract: Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target…
arXiv cs.AI TIER_1 English(EN) · Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low · 2026-06-02 04:00

MineDraft：批量并行推测解码框架

arXiv:2603.18016v2 Announce Type: replace-cross Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD i…
arXiv cs.CL TIER_1 English(EN) · Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li · 2026-06-02 04:00

DFlare：为块扩散推测解码扩展草稿容量

arXiv:2606.02091v1 Announce Type: new Abstract: Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable dra…
arXiv cs.CL TIER_1 English(EN) · Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev · 2026-06-02 04:00

LK 损失：投机性解码的直接接受率优化

arXiv:2602.23881v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is sig…
arXiv cs.LG TIER_1 English(EN) · Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang · 2026-06-02 04:00

DREAM-S：用于多模态生成的、具有可搜索草稿和目标感知精炼的推测性解码

arXiv:2606.00535v1 Announce Type: new Abstract: Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We…
arXiv cs.LG TIER_1 English(EN) · Yikang Yue, Yuqi Xue, Jian Huang · 2026-06-02 04:00

Vegas：带验证引导的稀疏注意力自推测解码

arXiv:2602.07223v2 Announce Type: replace Abstract: Long-context large language model (LLM) inference has become the norm for today's AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-specul…
arXiv cs.CL TIER_1 English(EN) · Sujian Li · 2026-06-01 11:18

DFlare：为块扩散推测解码扩展草稿容量

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target…
arXiv cs.CL TIER_1 English(EN) · Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei · 2026-06-01 04:00

推测性流水线解码：通过流水线并行实现更高精度和零气泡推测

arXiv:2605.30852v1 Announce Type: new Abstract: Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty a…
arXiv cs.CL TIER_1 English(EN) · Nirajan Paudel, Michael Ginn, Luc De Nardi, Alexis Palmer · 2026-06-01 04:00

跨语言的推测解码

arXiv:2605.30580v1 Announce Type: new Abstract: Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disp…
arXiv cs.AI TIER_1 English(EN) · Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman · 2026-05-29 04:00

SPEED-Bench：投机解码的统一且多样化的基准测试

arXiv:2604.09557v2 Announce Type: replace-cross Abstract: Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that dive…
arXiv cs.CL TIER_1 English(EN) · Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li · 2026-05-29 04:00

ToolSpec：通过模式感知和检索增强的推测性解码加速工具调用

arXiv:2604.13519v2 Announce Type: replace Abstract: Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, m…
arXiv cs.CL TIER_1 English(EN) · Jian Chen, Yesheng Liang, Zhijian Liu · 2026-05-29 04:00

DFlash：用于 Flash 投机解码的块扩散

arXiv:2602.06036v2 Announce Type: replace Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by usi…
arXiv cs.CL TIER_1 English(EN) · Jianuo Huang, Yaojie Zhang, Qituan Zhang, Hao Lin, Hanlin Xu, Linfeng Zhang · 2026-05-29 04:00

Domino：在推测性解码中将因果建模与自回归草稿解耦

arXiv:2605.29707v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost:…
arXiv cs.LG TIER_1 English(EN) · Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun · 2026-05-29 04:00

Bastion：具有成本意识的投机性解码与树状块扩散草稿

arXiv:2605.29727v1 Announce Type: new Abstract: Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled fro…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

推测性流水线解码：通过流水线并行实现更高精度和零气泡推测

Speculative Pipeline Decoding introduces a novel framework that leverages pipeline parallelism to accelerate large language model inference by enabling parallel token processing and reducing decoding latency.
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 10:21

Bastion：预算感知型推测解码与树状块扩散草稿

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully cond…
arXiv cs.CL TIER_1 English(EN) · Linfeng Zhang · 2026-05-28 10:07

Domino：在推测性解码中将因果建模与自回归草稿解耦

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependenci…
arXiv cs.AI TIER_1 English(EN) · Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee · 2026-05-28 04:00

SelfJudge：通过自监督裁判验证实现更快的投机解码

arXiv:2510.02329v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft …
arXiv cs.AI TIER_1 English(EN) · Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang · 2026-05-28 04:00

EvoSpec：通过实时词汇和参数自适应实现推测解码的演进

arXiv:2605.27390v1 Announce Type: cross Abstract: Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively re…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 00:00

Domino：在推测性解码中将因果建模与自回归草稿解耦

Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and t…
arXiv cs.CL TIER_1 English(EN) · Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu · 2026-05-27 04:00

AdaSD：用于高效语言模型推理的自适应推测解码

arXiv:2512.11280v2 Announce Type: replace Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a sm…
arXiv cs.CL TIER_1 English(EN) · Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma · 2026-05-27 04:00

MicroSpec：通过轻量级上下文词汇加速推测性解码

arXiv:2605.26444v1 Announce Type: new Abstract: Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary prunin…
arXiv cs.AI TIER_1 English(EN) · Avinash Kumar, Sujay Sanghavi, Poulami Das · 2026-05-27 04:00

HiSpec：LLM 的分层推测解码

arXiv:2510.01336v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token …
arXiv cs.CL TIER_1 English(EN) · Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum · 2026-05-26 04:00

超越目标：从模仿到协作的推测性解码

arXiv:2605.24793v1 Announce Type: new Abstract: Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the…
arXiv cs.CL TIER_1 English(EN) · Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou · 2026-05-22 04:00

SpecBlock：具有动态树草拟的块迭代推测解码

arXiv:2605.07243v2 Announce Type: replace Abstract: Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as…
arXiv cs.AI TIER_1 English(EN) · Cong Wang · 2026-05-19 16:55

少草拟，多检索：用于推测性解码的混合树构建

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottl…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-19 15:48

FlexDraft：通过注意力调整和奖励引导校准实现灵活的推测性解码

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …
arXiv cs.CL TIER_1 English(EN) · Linfeng Zhang · 2026-05-19 15:48

FlexDraft：通过注意力调整和奖励引导校准实现灵活的推测性解码

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …
X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-06-06 01:53

来自 @makora_ai 的序列蒙特卡洛推测解码让多个草稿令牌并行存活，而不是回溯失败的匹配。https://t.co/q9h9

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches. https://t.co/q9h9IZU3mG
arXiv cs.CV TIER_1 English(EN) · Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen · 2026-06-03 04:00

SJD-PAC：通过主动草拟和自适应续写加速推测性Jacobi解码

arXiv:2603.18599v2 Announce Type: replace Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex …
arXiv cs.CV TIER_1 English(EN) · Elia Peruzzo, Guillaume Sauti\`ere, Amirhossein Habibian · 2026-05-29 04:00

多尺度局部推测解码用于图像生成

arXiv:2601.05149v2 Announce Type: replace Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing a…
X — MiniMax AI TIER_1 Dansk(DA) · MiniMax_AI · 2026-06-03 23:34

1M tokens下解码速度提升15.6倍 🔥

15.6× faster decoding at 1M tokens 🔥 Thanks @FireworksAI_HQ for powering the inference behind M3. Try it now 👇
Together AI blog TIER_1 English(EN) · 2026-04-24 00:00

通过分布感知推测解码将 RL 部署速度提高高达 50%

Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.
Together AI blog TIER_1 English(EN) · 2025-05-12 00:00

通过定制化推测解码提升 DeepSeek-R1 的速度
MarkTechPost TIER_1 English(EN) · Michal Sutter · 2026-05-27 07:23

认识 EAGLE 3.1：解决 LLM 推理中注意力漂移问题的推测解码算法

<p>The EAGLE team, vLLM, and TorchSpec jointly release EAGLE 3.1 to fix speculative decoding instability in production.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/27/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference/"…
Mastodon — fosstodon.org TIER_1 Deutsch(DE) · [email protected] · 2026-06-12 04:02

RT @akshay_pachaar: 研究人员发现了一种将大型语言模型速度提升 8.5 倍的方法！（且不影响准确性）推测性解码是一种

RT @akshay_pachaar: Forscher haben einen Weg gefunden, LLMs um das 8,5-Fache zu beschleunigen! (ohne Kompromisse bei der Genauigkeit) Speculative Decoding ist eine äußerst effektive Methode, um das Single-Token-Bottleneck bei der herkömmlichen LLM-Inferenz zu adressieren. Ein kle…
dev.to — LLM tag TIER_1 English(EN) · byeongsoo kang · 2026-06-11 07:26

MTP并非总是赢家：我的3090上实现了1.95倍提升，但投机解码依赖硬件

<p>In <a href="https://bric.pe.kr/blog/qwen3-27b-rtx-3090-llama-cpp-mtp-doubling-tokens" rel="noopener noreferrer">my MTP post</a>, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting to read that as "turn on MTP, go faster." So I measured it on a…
r/LocalLLaMA TIER_1 English(EN) · /u/bigattichouse · 2026-06-09 01:50

2X tk/s (从 19.4 -> 38.1 tk/s，在 1 x MI50 上) 玩一个类似投机解码的假设……但不是用一个额外的侧模型，而是利用我可以并行运行多个计算，就好像我在内存中加载了两次 Qwen3.6-27B 一样——小的量化模型不会用完所有可用计算。

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u0rk0o/2x_tks_from_194_381_tks_on_1_x_mi50_playing_with/"> <img alt="2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side mod…
dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 03:07

LLM间的KL阈值路由：投机解码已解决的问题

<p>In late 2023 I started a paper called <em>Mixture-of-Experts: KL-Divergence Threshold</em>. The setup: run the small LLM by default, periodically check its next-token distribution against a larger reference model by computing KL divergence, fall back to the large model when th…
r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-06 12:16

Domino：在推测性解码中将因果建模与自回归草稿解耦

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tyfqmp/domino_decoupling_causal_modeling_from/"> <img alt="Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding" src="https://preview.redd.it/klo1qzrrln5h1.png?width=140&amp…
dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-05 02:15

投机性解码：何时以及为何它实际上能加速推理

<h1> Speculative decoding: when and why it actually speeds up inference </h1> <p>Your chat endpoint serves 200 requests per second. The model is a 70B Llama 3 fine-tune. The GPU is sitting at 78% utilization, but the user-facing latency is still bad — 380 ms to first token on the…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-27 07:57

EAGLE 3.1 修复了投机解码中的注意力漂移问题——使用小型草稿模型提出由大型目标模型验证的 token 以加速 LLM 推理

EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inference. The update adds FC normalisation and post-norm hidden states, delivering up to 2x longer acceptance length in long-co…

链接 marktechpost.com/…/meet-eagle-3-1-the-spe…
dev.to — LLM tag TIER_1 English(EN) · Ken W Alger · 2026-05-22 16:25

推测性解码模式

<h1>Pattern Defined</h1> <p><strong>Precise Definition:</strong> Speculative Decoding is an optimization pattern where a <br /> smaller, "draft" model predicts multiple upcoming tokens in parallel, which are <br /> then verified or corrected by a larger "oracle" model in a single…

报道来源 [60]

相关实体

相关话题