New methods accelerate LLM inference with speculative decoding

By PulseAugur Editorial · [60 sources] · 2025-05-12 00:00

Researchers have developed several new methods to accelerate large language model (LLM) inference through speculative decoding. AdaPLD improves retrieval and draft construction by using semantic similarity and branched hypotheses, achieving up to 3.10x speedup. SSSD combines n-gram matching with hardware-aware speculation for up to 2.9x latency reduction without training. D^2SD uses a dual diffusion model and confidence-guided prefix trees to enhance acceptance rates, while TAPS optimizes prefix tree selection for diffusion-drafted decoding, yielding up to 7.9x speedup. KnapSpec treats draft model selection as a knapsack problem to maximize throughput, achieving up to 1.47x speedup, and Vegas uses verification-guided sparse attention for improved decoding throughput. Additionally, LK Losses directly optimize the acceptance rate during training, leading to gains of 8-10% in average acceptance length. AI

IMPACT These advancements in speculative decoding promise significant speedups and efficiency gains for LLM inference, potentially lowering costs and increasing accessibility.

RANK_REASON Multiple research papers published on arXiv detailing new methods for speculative decoding in LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 60 sources. How we write summaries →

New methods accelerate LLM inference with speculative decoding

COVERAGE [60]

arXiv cs.AI TIER_1 English(EN) · Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Christopher Lott, Fatih Porikli, Mingu Lee · 2026-06-12 04:00

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

arXiv:2509.18085v4 Announce Type: replace-cross Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spi…
arXiv cs.AI TIER_1 English(EN) · Yuchen Xian, Yang He, Yunqiu Xu, Yi Yang · 2026-06-11 04:00

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

arXiv:2606.12243v1 Announce Type: cross Abstract: Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or ful…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 15:45

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected token…
arXiv cs.AI TIER_1 English(EN) · Yi Yang · 2026-06-10 15:45

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected token…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

VIA-SD introduces a multi-tier speculative decoding framework that uses intra-model routing to reduce verification costs by employing slim submodels for medium-confidence token validation, achieving significant speedups over traditional approaches.
arXiv cs.AI TIER_1 English(EN) · Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou · 2026-06-09 04:00

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

arXiv:2602.05774v4 Announce Type: replace-cross Abstract: Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled dra…
arXiv cs.AI TIER_1 English(EN) · Young D. Kwon, Miles Williams, Rui Li, Alexandros Kouris, Stylianos I. Venieris · 2026-06-09 04:00

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

arXiv:2606.07710v1 Announce Type: cross Abstract: The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic workloads. While speculative decoding (SD) accelerates inference, current approaches rely on…
arXiv cs.CL TIER_1 English(EN) · Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang · 2026-06-05 04:00

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

arXiv:2606.05742v1 Announce Type: new Abstract: Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and mo…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 06:09

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, …
arXiv cs.LG TIER_1 English(EN) · Liyuan Zhang, Jiarui Zhang, Jinwei Yao, Ran Yan, Yuchen Yang, Jiahao Zhang, Tongkai Yang, Yi Wu, Binhang Yuan · 2026-06-04 04:00

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

arXiv:2606.04446v1 Announce Type: cross Abstract: Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of token…
arXiv cs.AI TIER_1 English(EN) · Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. M\"uller, Lukas Cavigelli · 2026-06-04 04:00

SSSD: Simply-Scalable Speculative Decoding

arXiv:2411.05894v3 Announce Type: replace-cross Abstract: Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achi…
arXiv cs.AI TIER_1 English(EN) · Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han · 2026-06-03 04:00

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

arXiv:2602.20217v2 Announce Type: replace-cross Abstract: Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attent…
arXiv cs.LG TIER_1 English(EN) · Peer Rheinboldt, Fr\'ed\'eric Berdoz, Roger Wattenhofer · 2026-06-03 04:00

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

arXiv:2606.03819v1 Announce Type: new Abstract: One-shot block drafters for speculative decoding generate the full draft in a single forward pass, achieving strong throughput by eliminating sequential token generation. However, they predict each draft token conditioned only on th…
arXiv cs.LG TIER_1 English(EN) · Roger Wattenhofer · 2026-06-02 16:00

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

One-shot block drafters for speculative decoding generate the full draft in a single forward pass, achieving strong throughput by eliminating sequential token generation. However, they predict each draft token conditioned only on the prefix context, with no dependence on previous…
arXiv cs.CL TIER_1 English(EN) · Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai · 2026-06-02 04:00

Cost-Aware Diffusion Draft Trees for Speculative Decoding

arXiv:2606.01813v1 Announce Type: new Abstract: Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yieldin…
arXiv cs.AI TIER_1 English(EN) · Zhuoyu Wang, Junnan Huang, Xinyu Chen · 2026-06-02 04:00

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

arXiv:2606.00487v1 Announce Type: new Abstract: Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. Ho…
arXiv cs.AI TIER_1 English(EN) · Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang · 2026-06-02 04:00

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

arXiv:2606.00144v1 Announce Type: cross Abstract: Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU …
arXiv cs.AI TIER_1 English(EN) · Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard · 2026-06-02 04:00

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

arXiv:2606.01019v1 Announce Type: cross Abstract: Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target…
arXiv cs.AI TIER_1 English(EN) · Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low · 2026-06-02 04:00

MineDraft: A Framework for Batch Parallel Speculative Decoding

arXiv:2603.18016v2 Announce Type: replace-cross Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD i…
arXiv cs.CL TIER_1 English(EN) · Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li · 2026-06-02 04:00

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

arXiv:2606.02091v1 Announce Type: new Abstract: Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable dra…
arXiv cs.CL TIER_1 English(EN) · Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev · 2026-06-02 04:00

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

arXiv:2602.23881v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is sig…
arXiv cs.LG TIER_1 English(EN) · Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang · 2026-06-02 04:00

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

arXiv:2606.00535v1 Announce Type: new Abstract: Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We…
arXiv cs.LG TIER_1 English(EN) · Yikang Yue, Yuqi Xue, Jian Huang · 2026-06-02 04:00

Vegas: Self-Speculative Decoding with Verification-Guided Sparse Attention

arXiv:2602.07223v2 Announce Type: replace Abstract: Long-context large language model (LLM) inference has become the norm for today's AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-specul…
arXiv cs.CL TIER_1 English(EN) · Sujian Li · 2026-06-01 11:18

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target…
arXiv cs.CL TIER_1 English(EN) · Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei · 2026-06-01 04:00

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

arXiv:2605.30852v1 Announce Type: new Abstract: Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty a…
arXiv cs.CL TIER_1 English(EN) · Nirajan Paudel, Michael Ginn, Luc De Nardi, Alexis Palmer · 2026-06-01 04:00

Speculative Decoding Across Languages

arXiv:2605.30580v1 Announce Type: new Abstract: Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disp…
arXiv cs.AI TIER_1 English(EN) · Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman · 2026-05-29 04:00

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

arXiv:2604.09557v2 Announce Type: replace-cross Abstract: Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that dive…
arXiv cs.CL TIER_1 English(EN) · Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li · 2026-05-29 04:00

ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

arXiv:2604.13519v2 Announce Type: replace Abstract: Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, m…
arXiv cs.CL TIER_1 English(EN) · Jian Chen, Yesheng Liang, Zhijian Liu · 2026-05-29 04:00

DFlash: Block Diffusion for Flash Speculative Decoding

arXiv:2602.06036v2 Announce Type: replace Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by usi…
arXiv cs.CL TIER_1 English(EN) · Jianuo Huang, Yaojie Zhang, Qituan Zhang, Hao Lin, Hanlin Xu, Linfeng Zhang · 2026-05-29 04:00

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

arXiv:2605.29707v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost:…
arXiv cs.LG TIER_1 English(EN) · Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun · 2026-05-29 04:00

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

arXiv:2605.29727v1 Announce Type: new Abstract: Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled fro…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

Speculative Pipeline Decoding introduces a novel framework that leverages pipeline parallelism to accelerate large language model inference by enabling parallel token processing and reducing decoding latency.
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 10:21

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully cond…
arXiv cs.CL TIER_1 English(EN) · Linfeng Zhang · 2026-05-28 10:07

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependenci…
arXiv cs.AI TIER_1 English(EN) · Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee · 2026-05-28 04:00

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

arXiv:2510.02329v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft …
arXiv cs.AI TIER_1 English(EN) · Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang · 2026-05-28 04:00

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

arXiv:2605.27390v1 Announce Type: cross Abstract: Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively re…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 00:00

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and t…
arXiv cs.CL TIER_1 English(EN) · Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu · 2026-05-27 04:00

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

arXiv:2512.11280v2 Announce Type: replace Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a sm…
arXiv cs.CL TIER_1 English(EN) · Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma · 2026-05-27 04:00

MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

arXiv:2605.26444v1 Announce Type: new Abstract: Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary prunin…
arXiv cs.AI TIER_1 English(EN) · Avinash Kumar, Sujay Sanghavi, Poulami Das · 2026-05-27 04:00

HiSpec: Hierarchical Speculative Decoding for LLMs

arXiv:2510.01336v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token …
arXiv cs.CL TIER_1 English(EN) · Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum · 2026-05-26 04:00

Beyond the Target: From Imitation to Collaboration in Speculative Decoding

arXiv:2605.24793v1 Announce Type: new Abstract: Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the…
arXiv cs.CL TIER_1 English(EN) · Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou · 2026-05-22 04:00

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

arXiv:2605.07243v2 Announce Type: replace Abstract: Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as…
arXiv cs.AI TIER_1 English(EN) · Cong Wang · 2026-05-19 16:55

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottl…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-19 15:48

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …
arXiv cs.CL TIER_1 English(EN) · Linfeng Zhang · 2026-05-19 15:48

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …
X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-06-06 01:53

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches. https://t.co/q9h9

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches. https://t.co/q9h9IZU3mG
arXiv cs.CV TIER_1 English(EN) · Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen · 2026-06-03 04:00

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

arXiv:2603.18599v2 Announce Type: replace Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex …
arXiv cs.CV TIER_1 English(EN) · Elia Peruzzo, Guillaume Sauti\`ere, Amirhossein Habibian · 2026-05-29 04:00

Multi-Scale Local Speculative Decoding for Image Generation

arXiv:2601.05149v2 Announce Type: replace Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing a…
X — MiniMax AI TIER_1 Dansk(DA) · MiniMax_AI · 2026-06-03 23:34

15.6× faster decoding at 1M tokens 🔥

15.6× faster decoding at 1M tokens 🔥 Thanks @FireworksAI_HQ for powering the inference behind M3. Try it now 👇
Together AI blog TIER_1 English(EN) · 2026-04-24 00:00

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.
Together AI blog TIER_1 English(EN) · 2025-05-12 00:00

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding
MarkTechPost TIER_1 English(EN) · Michal Sutter · 2026-05-27 07:23

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

<p>The EAGLE team, vLLM, and TorchSpec jointly release EAGLE 3.1 to fix speculative decoding instability in production.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/27/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference/"…
Mastodon — fosstodon.org TIER_1 Deutsch(DE) · [email protected] · 2026-06-12 04:02

RT @akshay_pachaar: Researchers have found a way to speed up LLMs by 8.5x! (without compromising accuracy) Speculative Decoding is a

RT @akshay_pachaar: Forscher haben einen Weg gefunden, LLMs um das 8,5-Fache zu beschleunigen! (ohne Kompromisse bei der Genauigkeit) Speculative Decoding ist eine äußerst effektive Methode, um das Single-Token-Bottleneck bei der herkömmlichen LLM-Inferenz zu adressieren. Ein kle…
dev.to — LLM tag TIER_1 English(EN) · byeongsoo kang · 2026-06-11 07:26

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

<p>In <a href="https://bric.pe.kr/blog/qwen3-27b-rtx-3090-llama-cpp-mtp-doubling-tokens" rel="noopener noreferrer">my MTP post</a>, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting to read that as "turn on MTP, go faster." So I measured it on a…
r/LocalLLaMA TIER_1 English(EN) · /u/bigattichouse · 2026-06-09 01:50

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u0rk0o/2x_tks_from_194_381_tks_on_1_x_mi50_playing_with/"> <img alt="2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side mod…
dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 03:07

KL-Threshold Routing Between LLMs: What Speculative Decoding Already Solved

<p>In late 2023 I started a paper called <em>Mixture-of-Experts: KL-Divergence Threshold</em>. The setup: run the small LLM by default, periodically check its next-token distribution against a larger reference model by computing KL divergence, fall back to the large model when th…
r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-06 12:16

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tyfqmp/domino_decoupling_causal_modeling_from/"> <img alt="Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding" src="https://preview.redd.it/klo1qzrrln5h1.png?width=140&amp…
dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-05 02:15

Speculative decoding: when and why it actually speeds up inference

<h1> Speculative decoding: when and why it actually speeds up inference </h1> <p>Your chat endpoint serves 200 requests per second. The model is a 70B Llama 3 fine-tune. The GPU is sitting at 78% utilization, but the user-facing latency is still bad — 380 ms to first token on the…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-27 07:57

EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inferenc

EAGLE 3.1 fixes attention drift in speculative decoding - using a small draft model to propose tokens verified by a larger target model to speed up LLM inference. The update adds FC normalisation and post-norm hidden states, delivering up to 2x longer acceptance length in long-co…

LINKS marktechpost.com/…/meet-eagle-3-1-the-spe…
dev.to — LLM tag TIER_1 English(EN) · Ken W Alger · 2026-05-22 16:25

The Speculative Decoding Pattern

<h1>Pattern Defined</h1> <p><strong>Precise Definition:</strong> Speculative Decoding is an optimization pattern where a <br /> smaller, "draft" model predicts multiple upcoming tokens in parallel, which are <br /> then verified or corrected by a larger "oracle" model in a single…

COVERAGE [60]

RELATED ENTITIES

RELATED TOPICS