Researchers from UC San Diego have developed DFlash, a novel speculative decoding method that significantly accelerates large language model inference. Unlike previous methods that draft tokens one by one, DFlash proposes entire blocks of tokens in parallel using a lightweight block diffusion model. This approach reportedly achieves over 6x lossless acceleration on various models and tasks, and up to 15x higher throughput on NVIDIA Blackwell GPUs for GPT-OSS 120B when compared to existing techniques like EAGLE-3. AI
IMPACT DFlash's parallel block drafting could significantly reduce LLM inference costs and latency, enabling more complex and interactive AI applications.
RANK_REASON Research paper introducing a new method for LLM inference acceleration. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Mastodon — fosstodon.org →
- DiffuSpec
- FLASH
- GPT-OSS 120B
- NVIDIA
- NVIDIA Blackwell B200
- Qwen3-coder
- SpecDiff-2
- University of California, San Diego
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →