PulseAugur
EN
LIVE 23:52:00

Google's DiffusionGemma LLM Achieves 1000 Tokens/Sec with Diffusion Architecture

Google DeepMind has released DiffusionGemma, an open-weight LLM that utilizes a diffusion architecture for text generation, enabling significantly faster inference speeds compared to traditional autoregressive models. This new model can process up to 1,000 tokens per second on a single H100 GPU and requires only 18 GB of VRAM, making it efficient for single-GPU deployments. While it trades some accuracy for speed, it excels in tasks like code infilling and real-time applications, and also supports multi-modal inputs including images and video. AI

IMPACT Accelerates inference speeds and reduces VRAM requirements, potentially enabling new real-time applications and wider single-GPU deployments.

RANK_REASON New open-weight diffusion-based LLM release from Google DeepMind. [lever_c_demoted from frontier_release: ic=1 ai=1.0]

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Sayed Ali Alkamel ·

    DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

    <blockquote> <p><strong>TL;DR:</strong> Google released DiffusionGemma, an open Apache 2.0 diffusion-based LLM that generates text up to 4x faster than autoregressive models, hitting 1,000+ tokens/sec on a single H100 and fitting in 18 GB VRAM. It trades some accuracy for speed. …