DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics
Google DeepMind has released DiffusionGemma, an open-weight LLM that utilizes a diffusion architecture for text generation, enabling significantly faster inference speeds compared to traditional autoregressive models. This new model can process up to 1,000 tokens per second on a single H100 GPU and requires only 18 GB of VRAM, making it efficient for single-GPU deployments. While it trades some accuracy for speed, it excels in tasks like code infilling and real-time applications, and also supports multi-modal inputs including images and video. AI
IMPACT Accelerates inference speeds and reduces VRAM requirements, potentially enabling new real-time applications and wider single-GPU deployments.