English(EN) 120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Gemma 4 12B 模型在 12GB VRAM 下达到 120 tokens/sec

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-06 18:53

Reddit r/LocalLLaMA 子版块的一位用户使用 Google 的 Gemma 4 12B 模型实现了每秒 120 token 的推理速度。这是通过使用该模型的量化感知训练 (QAT) 变体实现的，具体为 GGUF 格式，运行在具有 12GB VRAM 的系统上。该设置涉及 llama.cpp 的补丁版本和特定的模型文件，展示了在消费级硬件上高效地本地运行大型语言模型。 AI

影响展示了在消费级硬件上高效的本地 LLM 推理，可能降低开发者的门槛。

排序理由用户驱动的现有模型发布的基准测试和优化。[lever_c_demoted from research: ic=1 ai=1.0]

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/janvitos · 2026-06-06 18:53

12GB显存上120 token/秒，使用Gemma 4 12B QAT MTP

<div class="md"><p>Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result!</p…

报道来源 [1]

12GB显存上120 token/秒，使用Gemma 4 12B QAT MTP

相关实体

相关话题