DFlash accelerates LLM inference by drafting token blocks in parallel

By PulseAugur Editorial · [2 sources] · 2026-06-24 07:21

Researchers from UC San Diego have developed DFlash, a novel speculative decoding method that significantly accelerates large language model inference. Unlike previous methods that draft tokens one by one, DFlash proposes entire blocks of tokens in parallel using a lightweight block diffusion model. This approach reportedly achieves over 6x lossless acceleration on various models and tasks, and up to 15x higher throughput on NVIDIA Blackwell GPUs for GPT-OSS 120B when compared to existing techniques like EAGLE-3. AI

IMPACT DFlash's parallel block drafting could significantly reduce LLM inference costs and latency, enabling more complex and interactive AI applications.

RANK_REASON Research paper introducing a new method for LLM inference acceleration. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — fosstodon.org →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

DFlash accelerates LLM inference by drafting token blocks in parallel

COVERAGE [2]

MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-06-24 07:21

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

<p>UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-24 07:55

UC San Diego researchers have developed DFlash, a block diffusion model that drafts whole token blocks in a single pass for speculative decoding. The technique

UC San Diego researchers have developed DFlash, a block diffusion model that drafts whole token blocks in a single pass for speculative decoding. The technique delivers up to 15x higher throughput on NVIDIA Blackwell GPUs compared to traditional autoregressive methods. It works b…

COVERAGE [2]

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego researchers have developed DFlash, a block diffusion model that drafts whole token blocks in a single pass for speculative decoding. The technique

RELATED ENTITIES

RELATED TOPICS