PulseAugur
EN
LIVE 20:53:31

Gemma 4 12B model reaches 120 tokens/sec on 12GB VRAM

A user on Reddit's r/LocalLLaMA subreddit has achieved 120 tokens per second inference speed with Google's Gemma 4 12B model. This was accomplished using a Quantization-Aware Training (QAT) variant of the model, specifically a GGUF format, running on a system with 12GB of VRAM. The setup involved a patched version of llama.cpp and specific model files, demonstrating efficient local execution of a large language model on consumer hardware. AI

IMPACT Demonstrates efficient local LLM inference on consumer hardware, potentially lowering barriers for developers.

RANK_REASON User-driven benchmark and optimization of an existing model release. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/janvitos ·

    120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

    <!-- SC_OFF --><div class="md"><p>Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result!</p…