PulseAugur
EN
LIVE 23:10:31

llama.cpp flag boosts Qwen 35B model speed by 2.8x on RTX 4070

A technical guide demonstrates how to achieve a 2.8x speedup when running the Qwen3.5-35B-A3B model on an RTX 4070 GPU with 12GB of VRAM. The key to this performance increase lies in using the `llama.cpp` framework with specific flags: `-ngl 99` to offload all model layers to the GPU and `--cpu-moe` to keep the Mixture of Experts (MoE) layers on the CPU. This strategy is particularly effective for MoE models, where only a fraction of experts are active per token, making it inefficient to load all experts onto the GPU when VRAM is limited. The guide also provides a sweep of different offload configurations to help users determine the optimal settings for various VRAM tiers. AI

IMPACT Optimizing LLM inference speed on consumer hardware, making larger models more accessible.

RANK_REASON Technical guide on optimizing a specific model's performance using particular software flags.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

llama.cpp flag boosts Qwen 35B model speed by 2.8x on RTX 4070

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Ken Imoto ·

    RTX 4070 + Qwen 35B: 2.8x Speedup From One llama.cpp Flag (--cpu-moe)

    <p>The Ollama defaults gave me <strong>12.2 tok/s</strong> on Qwen3.5-35B-A3B against an RTX 4070 (12 GB). I switched to <code>llama.cpp</code> with two flags and got <strong>34.6 tok/s</strong>. 2.8x.</p> <p>The two flags were <code>-ngl 99</code> (offload all layers to GPU) and…