English(EN) Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

Qwen 3.6 27B 模型在投机解码参数下性能下降

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-06 10:05

一位 Reddit r/LocalLLaMA 社区的用户在使用 Qwen 3.6 27B 模型时，遇到了与投机解码相关的特定参数导致推理速度和 GPU 利用率显著下降的问题。当包含 `--spec-type draft-mtp` 和 `--spec-draft-n-max` 等参数时，他们的吞吐量从每秒 70 个 token 下降到每秒 30 个 token，GPU 功耗也随之大幅降低。用户怀疑这可能是 llama.cpp 最近更新导致的问题，因为之前的性能要高得多。 AI

影响开源 LLM 推理引擎中潜在的性能回归会影响本地部署的效率。

排序理由用户报告的开源模型和推理引擎性能问题。[lever_c_demoted from research: ic=1 ai=1.0]

在 r/LocalLLaMA 阅读 →

模型发布

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/BitGreen1270 · 2026-06-06 10:05

Qwen 3.6 27B MTP - 添加 spec-type 和 spec-draft-n-max 会降低 tps 并减少 GPU 利用率

<div class="md"><p>I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s:</p> <p><code>bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 …

报道来源 [1]

Qwen 3.6 27B MTP - 添加 spec-type 和 spec-draft-n-max 会降低 tps 并减少 GPU 利用率

相关实体

相关话题