PulseAugur
EN
LIVE 12:40:00

DFlash accelerates LLM inference by drafting token blocks in parallel

Researchers from UC San Diego have developed DFlash, a novel speculative decoding method that significantly accelerates large language model inference. Unlike previous methods that draft tokens one by one, DFlash proposes entire blocks of tokens in parallel using a lightweight block diffusion model. This approach reportedly achieves over 6x lossless acceleration on various models and tasks, and up to 15x higher throughput on NVIDIA Blackwell GPUs for GPT-OSS 120B when compared to existing techniques like EAGLE-3. AI

IMPACT DFlash's parallel block drafting could significantly reduce LLM inference costs and latency, enabling more complex and interactive AI applications.

RANK_REASON Research paper introducing a new method for LLM inference acceleration. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — fosstodon.org →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

DFlash accelerates LLM inference by drafting token blocks in parallel

COVERAGE [2]

  1. MarkTechPost TIER_1 English(EN) · Asif Razzaq ·

    DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

    <p>UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless…

  2. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    UC San Diego researchers have developed DFlash, a block diffusion model that drafts whole token blocks in a single pass for speculative decoding. The technique

    UC San Diego researchers have developed DFlash, a block diffusion model that drafts whole token blocks in a single pass for speculative decoding. The technique delivers up to 15x higher throughput on NVIDIA Blackwell GPUs compared to traditional autoregressive methods. It works b…