Google DeepMind has released DiffusionGemma, an open-weight LLM that utilizes a diffusion architecture for text generation, enabling significantly faster inference speeds compared to traditional autoregressive models. This new model can process up to 1,000 tokens per second on a single H100 GPU and requires only 18 GB of VRAM, making it efficient for single-GPU deployments. While it trades some accuracy for speed, it excels in tasks like code infilling and real-time applications, and also supports multi-modal inputs including images and video. AI
IMPACT Accelerates inference speeds and reduces VRAM requirements, potentially enabling new real-time applications and wider single-GPU deployments.
RANK_REASON New open-weight diffusion-based LLM release from Google DeepMind. [lever_c_demoted from frontier_release: ic=1 ai=1.0]
- DiffusionGemma
- Gemma 4
- Google Cloud Vertex AI Model Garden
- Google DeepMind
- Hugging Face Transformers
- JAX
- NVIDIA NIM
- Stable Diffusion
- vLLM
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →