Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas?
A user on the r/LocalLLaMA subreddit is seeking advice on optimizing their setup for faster token generation with Google's Gemma 4 model. They are experiencing a maximum speed increase of 33%, reaching 100 tokens per second, and are looking for ways to improve this performance. The user has detailed their hardware configuration, including dual RTX 3060 Ti GPUs, and the specific command-line parameters they are using with llama.cpp. AI
IMPACT Users can learn about potential performance improvements and tuning strategies for running local LLMs.