PulseAugur
EN
LIVE 18:34:42

Gemma 4 E4B achieves 2.4x speedup with LiteRT engine

A user has achieved a 2.4x speedup in text generation using Google's Gemma 4 E4B model by employing the LiteRT engine with multi-token prediction (MTP). This optimization significantly outperforms the standard Q4 GGUF quantization in llama.cpp for text-based tasks. However, for image captioning, the speed improvement was only marginal (1.1x) because the vision encoder, not the text decoder, was the bottleneck. The user has created a Python wrapper to provide an OpenAI-compatible endpoint for this faster local model, integrating it into their workflow. AI

IMPACT Demonstrates significant local inference speedups for open-source models, potentially lowering barriers to advanced AI use.

RANK_REASON User-driven performance optimization and benchmark of an existing model. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/AnticitizenPrime ·

    Using Gemma 4 E4B with the LiteRT engine - ~2.4x speedup over Q4 GGUF in text generation, image processing roughly the same

    <!-- SC_OFF --><div class="md"><p>I know there is a PR in llama.cpp to support MTP for the 26b and 31b versions of Gemma 4, but as far as I can tell there is nothing yet for the E2B and E4B models.</p> <p>Using Hermes Agent, I had it set up Gemma 4 E4B in Google's Lite RT format,…