PulseAugur
LIVE 06:22:31
research · [7 sources] ·
0
research

New techniques like UniVer and SpecKV boost LLM inference speed via speculative decoding

Researchers have developed new methods to accelerate large language model (LLM) inference. UniVer offers a unified approach to multi-step and multi-draft speculative decoding, improving acceptance length by up to 8.5%. Speculative speculative decoding (SSD) introduces a method to parallelize verification and speculation, with an optimized algorithm called Saguaro achieving up to 5x speedup over autoregressive decoding. Additionally, SpecKV introduces an adaptive controller that dynamically selects speculation length based on model compression and draft model signals, yielding a 56.0% improvement over fixed-length speculation. AI

Summary written by gemini-2.5-flash-lite from 7 sources. How we write summaries →

IMPACT New speculative decoding techniques promise significant speedups in LLM inference, potentially reducing computational costs and latency.

RANK_REASON Multiple arXiv papers introduce novel techniques for accelerating LLM inference.

Read on arXiv cs.LG →

COVERAGE [7]

  1. arXiv cs.LG TIER_1 · Yepeng Weng, Qiao Hu, Takehisa Yairi ·

    UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

    arXiv:2605.04543v1 Announce Type: cross Abstract: Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolat…

  2. arXiv cs.CL TIER_1 · Takehisa Yairi ·

    UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

    Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation, applying either flat OT to single-step drafts…

  3. arXiv cs.LG TIER_1 Română(RO) · Tanishq Kumar, Tri Dao, Avner May ·

    Speculative Speculative Decoding

    arXiv:2603.03251v3 Announce Type: replace Abstract: Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then…

  4. arXiv cs.CL TIER_1 · Shikhar Shukla ·

    SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

    arXiv:2605.02888v1 Announce Type: cross Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation lengt…

  5. arXiv cs.CL TIER_1 · Shikhar Shukla ·

    SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

    Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length $γ$, which determines how many tokens the draft …

  6. arXiv cs.CL TIER_1 · Shikhar Shukla ·

    SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

    Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length~$γ$, which determines how many tokens the draft …

  7. arXiv cs.LG TIER_1 · Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao ·

    Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

    arXiv:2604.21952v1 Announce Type: new Abstract: This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computatio…