PulseAugur
EN
LIVE 02:27:29

LLaMA user sees doubled inference speed with Qwen model after CPU parameter change

A user on Reddit's r/LocalLLaMA subreddit is seeking assistance understanding unexpected performance gains when running the Qwen3.6-35B-A3B-UD-Q4_K_XL model. They observed a doubling of inference speed, from 17 to 34 tokens/second, after increasing the `--n-cpu-moe` parameter from 8 to 30, which contradicts their expectation of a performance decrease due to increased CPU load. The user is also inquiring about further optimizations for their setup, which includes 12GB VRAM and 32GB RAM, utilizing llama.cpp with the TurboQuant variant. AI

RANK_REASON This is a user-generated question on a specific technical configuration, not a general industry announcement or development.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/MackTuesday ·

    Could someone please help explain these results?

    <!-- SC_OFF --><div class="md"><p>I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from t…