PulseAugur
EN
LIVE 23:28:36

User seeks help optimizing MTP in llama.cpp server

A user on Reddit is seeking assistance with implementing the "draft-mtp" (Multi-Turn Prompting) feature in the llama.cpp server. They have downloaded a specific model, Qwen3.6-35B-A3B-MTP-GGUF, and are attempting to run it with the MTP flag enabled. Initial benchmarks show a decrease in token generation speed when MTP is active, and the user is inquiring about potential causes and methods to improve the draft acceptance rate. AI

IMPACT Troubleshooting a specific feature in an open-source LLM inference tool, with potential performance improvements for users.

RANK_REASON User-generated content discussing the implementation and performance of a specific feature within an open-source tool.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 (AF) · /u/Ok_Warning2146 ·

    How do I make MTP work in llama-server?

    <!-- SC_OFF --><div class="md"><p>Downloaded IQ4_NL gguf from unsloth/Qwen3.6-35B-A3B-MTP-GGUF.</p> <p>git cloned a recent llama.cpp (version: 9397 (ac4b5a3fd)) and compiled it with GGML_CUDA=ON to run on my single 3090 </p> <p>llama-server command without MTP:<br /> ./build/bin/…