PulseAugur
EN
LIVE 06:08:33
meme · [1 source] ·

LLaMA user sees doubled inference speed with Qwen model after CPU parameter change

A user on Reddit's r/LocalLLaMA subreddit is seeking assistance understanding unexpected performance gains when running the Qwen3.6-35B-A3B-UD-Q4_K_XL model. They observed a doubling of inference speed, from 17 to 34 tokens/second, after increasing the `--n-cpu-moe` parameter from 8 to 30, which contradicts their expectation of a performance decrease due to increased CPU load. The user is also inquiring about further optimizations for their setup, which includes 12GB VRAM and 32GB RAM, utilizing llama.cpp with the TurboQuant variant. AI

Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →

RANK_REASON This is a user-generated question on a specific technical configuration, not a general industry announcement or development.

Read on r/LocalLLaMA →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 · /u/MackTuesday ·

    Could someone please help explain these results?

    <!-- SC_OFF --><div class="md"><p>I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from t…