PulseAugur
实时 18:48:40
English(EN) Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

UniVer和SpecKV等新技术通过推测性解码提升LLM推理速度

研究人员开发了新的方法来加速大型语言模型(LLM)的推理。UniVer为多步和多草稿推测性解码提供了一种统一的方法,将接受长度提高了高达8.5%。推测性解码(SSD)引入了一种并行化验证和推测的方法,其优化的Saguaro算法在自回归解码方面实现了高达5倍的加速。此外,SpecKV引入了一种自适应控制器,该控制器根据模型压缩和草稿模型信号动态选择推测长度,与固定长度推测相比,性能提高了56.0%。 AI

影响 新的推测性解码技术有望显著提高LLM推理速度,从而降低计算成本和延迟。

排序理由 多篇arXiv论文介绍了加速LLM推理的新技术。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

UniVer和SpecKV等新技术通过推测性解码提升LLM推理速度

报道来源 [7]

  1. arXiv cs.LG TIER_1 English(EN) · Yepeng Weng, Qiao Hu, Takehisa Yairi ·

    UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

    arXiv:2605.04543v1 Announce Type: cross Abstract: Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolat…

  2. arXiv cs.CL TIER_1 English(EN) · Takehisa Yairi ·

    UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

    Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation, applying either flat OT to single-step drafts…

  3. arXiv cs.LG TIER_1 Română(RO) · Tanishq Kumar, Tri Dao, Avner May ·

    Speculative Speculative Decoding

    arXiv:2603.03251v3 Announce Type: replace Abstract: Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then…

  4. arXiv cs.CL TIER_1 English(EN) · Shikhar Shukla ·

    SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

    arXiv:2605.02888v1 Announce Type: cross Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation lengt…

  5. arXiv cs.CL TIER_1 English(EN) · Shikhar Shukla ·

    SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

    Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length $γ$, which determines how many tokens the draft …

  6. arXiv cs.CL TIER_1 English(EN) · Shikhar Shukla ·

    SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

    Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length~$γ$, which determines how many tokens the draft …

  7. arXiv cs.LG TIER_1 English(EN) · Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao ·

    Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

    arXiv:2604.21952v1 Announce Type: new Abstract: This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computatio…