Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 4h

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

A user on Reddit's r/LocalLLaMA subreddit has achieved 120 tokens per second inference speed with Google's Gemma 4 12B model. This was accomplished using a Quantization-Aware Training (QAT) variant of the model, specifically a GGUF format, running on a system with 12GB of VRAM. The setup involved a patched version of llama.cpp and specific model files, demonstrating efficient local execution of a large language model on consumer hardware. AI

IMPACT Demonstrates efficient local LLM inference on consumer hardware, potentially lowering barriers for developers.

Google
llama.cpp
Unsloth
Gemma 4 12B
RTX 4070 Super