Brief · PulseAugur

SIGNIFICANT · dev.to — LLM tag English(EN) · 5h

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

Google DeepMind has released DiffusionGemma, an open-weight LLM that utilizes a diffusion architecture for text generation, enabling significantly faster inference speeds compared to traditional autoregressive models. This new model can process up to 1,000 tokens per second on a single H100 GPU and requires only 18 GB of VRAM, making it efficient for single-GPU deployments. While it trades some accuracy for speed, it excels in tasks like code infilling and real-time applications, and also supports multi-modal inputs including images and video. AI

IMPACT Accelerates inference speeds and reduces VRAM requirements, potentially enabling new real-time applications and wider single-GPU deployments.

Google DeepMind
Stable Diffusion
NVIDIA NIM
JAX
Hugging Face Transformers
vLLM
Gemma 4
DiffusionGemma
Google Cloud Vertex AI Model Garden