Local LLM Speed Boosted by Gemma 4 MTP and QAT

By PulseAugur Editorial · [2 sources] · 2026-06-08 15:04

A recent update to the "Run LLMs Locally" project has introduced Multi-Token-Prediction (MTP) for Gemma models, achieving speed improvements of up to 90% in token generation. This optimization, combined with Quantization-Aware Training (QAT), has led to significant performance gains for local LLM execution. Additionally, prompt sizes have been reduced by 60% through configuration adjustments, and logging of all prompts has been implemented. AI

IMPACT These optimizations for local LLM execution could lower the barrier to entry for advanced AI applications, enabling more users to run powerful models on consumer hardware.

RANK_REASON The cluster discusses optimizations and performance improvements for running existing LLM models locally, which falls under research and development in AI.

Read on Mastodon — sigmoid.social →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Local LLM Speed Boosted by Gemma 4 MTP and QAT

COVERAGE [2]

Mastodon — sigmoid.social TIER_1 English(EN) · [email protected] · 2026-06-10 18:28

Same week, small update: Run LLMs Locally Multi-Token-Prediction (MTP) for Gemma-4-E4B and Gemma-4-26B from Unsloth. After 50% from QAT, this brings another 25-

Same week, small update: Run LLMs Locally Multi-Token-Prediction (MTP) for Gemma-4-E4B and Gemma-4-26B from Unsloth. After 50% from QAT, this brings another 25-90% improvement in token generation speed. The OpenCode config slide received a small update to reduce prompt sizes with…

LINKS codeberg.org/…/Run_LLMs_Locally_2026_Thom…
r/LocalLLaMA TIER_1 English(EN) · /u/Ready_Performance_35 · 2026-06-08 15:04

Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas?

<div class="md">Hello, My setup is 2x RTX 3060 Ti 8GB, without the assistant model (MTP) I get around 75t/s, adding the assistant model as draft I manage to reach 100t/s peak. I tried puting the model on a single card with minimal context si…

COVERAGE [2]

Same week, small update: Run LLMs Locally Multi-Token-Prediction (MTP) for Gemma-4-E4B and Gemma-4-26B from Unsloth. After 50% from QAT, this brings another 25-

Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas?

RELATED ENTITIES

RELATED TOPICS