The Gemma 4 QAT MTP assistant heads have been released on HuggingFace, offering improved performance for speculative decoding. These heads are specifically trained to match the quantized weights of the Gemma 4 models, significantly increasing acceptance rates compared to non-QAT matched heads. Additionally, a critical crash bug in the llama.cpp implementation when using two parallel processing threads has been identified and fixed, improving stability for local LLM inference. AI
IMPACT Enables more efficient local inference for Gemma 4 models by providing optimized components and fixing critical bugs.
RANK_REASON Release of model components and a bug fix for local LLM inference software. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →