English(EN) DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

谷歌DiffusionGemma LLM采用扩散架构实现每秒1000个token

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-12 18:30

Google DeepMind发布了DiffusionGemma，这是一款开源LLM，它利用扩散架构进行文本生成，与传统的自回归模型相比，推理速度显著提高。该模型在单个H100 GPU上每秒可处理多达1000个token，且仅需18 GB VRAM，使其能够高效地在单GPU上部署。虽然它在速度上牺牲了一些准确性，但在代码填充和实时应用等任务中表现出色，并且还支持包括图像和视频在内的多模态输入。 AI

影响加速推理速度并降低VRAM需求，可能催生新的实时应用和更广泛的单GPU部署。

排序理由谷歌DeepMind发布新款开源扩散式LLM。 [lever_c_demoted from frontier_release: ic=1 ai=1.0]

在 dev.to — LLM tag 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

dev.to — LLM tag TIER_1 English(EN) · Sayed Ali Alkamel · 2026-06-12 18:30

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

<blockquote> <p><strong>TL;DR:</strong> Google released DiffusionGemma, an open Apache 2.0 diffusion-based LLM that generates text up to 4x faster than autoregressive models, hitting 1,000+ tokens/sec on a single H100 and fitting in 18 GB VRAM. It trades some accuracy for speed. …

报道来源 [1]

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

相关实体

相关话题