PulseAugur
EN
LIVE 11:02:37

User seeks fix for Gemma 4 31B model repeating tokens

A user on the r/LocalLLaMA subreddit is seeking assistance with running the Gemma 4 31B QAT GGUF model. Despite successfully loading the main model and an MTP assistant head, the model consistently outputs repeated \u003Cunused49\u003E tokens instead of coherent text. The user has attempted various configurations, including different model files, local compatibility fixes, and command-line arguments, but has not found a working solution. AI

IMPACT Troubleshooting a specific model configuration may help other users facing similar issues with local LLM deployments.

RANK_REASON User-generated technical support request for a specific model version and format. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/WaveformEntropy ·

    Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe?

    <!-- SC_OFF --><div class="md"><p>I’m trying to run:</p> <p>unsloth/gemma-4-31B-it-qat-GGUF</p> <p>gemma-4-31B-it-qat-UD-Q4_K_XL.gguf</p> <p>on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch.</p> <p>Main model loads. Without the MTP assistant head, /v1/chat/completions re…