Gemma 4 QAT MTP heads released, crash fix enables parallel processing

By PulseAugur Editorial · [1 sources] · 2026-06-06 21:41

The Gemma 4 QAT MTP assistant heads have been released on HuggingFace, offering improved performance for speculative decoding. These heads are specifically trained to match the quantized weights of the Gemma 4 models, significantly increasing acceptance rates compared to non-QAT matched heads. Additionally, a critical crash bug in the llama.cpp implementation when using two parallel processing threads has been identified and fixed, improving stability for local LLM inference. AI

IMPACT Enables more efficient local inference for Gemma 4 models by providing optimized components and fixing critical bugs.

RANK_REASON Release of model components and a bug fix for local LLM inference software. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/westsunset · 2026-06-06 21:41

QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench

<div class="md"><hr /> Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan) <hr /> Three things in one update: the converted QAT-matched draft heads are now uploa…

COVERAGE [1]

QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench

RELATED ENTITIES

RELATED TOPICS