QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench
The Gemma 4 QAT MTP assistant heads have been released on HuggingFace, offering improved performance for speculative decoding. These heads are specifically trained to match the quantized weights of the Gemma 4 models, significantly increasing acceptance rates compared to non-QAT matched heads. Additionally, a critical crash bug in the llama.cpp implementation when using two parallel processing threads has been identified and fixed, improving stability for local LLM inference. AI
IMPACT Enables more efficient local inference for Gemma 4 models by providing optimized components and fixing critical bugs.