PulseAugur
EN
LIVE 11:53:53

Qwen 3.6 27B model performance drops with speculative decoding params

A user on the r/LocalLLaMA subreddit is experiencing a significant drop in inference speed and GPU utilization when using the Qwen 3.6 27B model with specific parameters related to speculative decoding. When parameters like `--spec-type draft-mtp` and `--spec-draft-n-max` are included, their throughput plummets from 70 tokens/second to 30 tokens/second, and GPU power draw decreases substantially. The user suspects a recent update to llama.cpp might be the cause, as the performance was previously much higher. AI

IMPACT Potential performance regressions in open-source LLM inference engines can impact local deployment efficiency.

RANK_REASON User-reported performance issue with an open-source model and inference engine. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/BitGreen1270 ·

    Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

    <!-- SC_OFF --><div class="md"><p>I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s:</p> <p><code>bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 …