User seeks help optimizing MTP in llama.cpp server

By PulseAugur Editorial · [1 sources] · 2026-05-29 07:41

A user on Reddit is seeking assistance with implementing the "draft-mtp" (Multi-Turn Prompting) feature in the llama.cpp server. They have downloaded a specific model, Qwen3.6-35B-A3B-MTP-GGUF, and are attempting to run it with the MTP flag enabled. Initial benchmarks show a decrease in token generation speed when MTP is active, and the user is inquiring about potential causes and methods to improve the draft acceptance rate. AI

IMPACT Troubleshooting a specific feature in an open-source LLM inference tool, with potential performance improvements for users.

RANK_REASON User-generated content discussing the implementation and performance of a specific feature within an open-source tool.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

r/LocalLLaMA TIER_1 (AF) · /u/Ok_Warning2146 · 2026-05-29 07:41

How do I make MTP work in llama-server?

<div class="md">Downloaded IQ4_NL gguf from unsloth/Qwen3.6-35B-A3B-MTP-GGUF. git cloned a recent llama.cpp (version: 9397 (ac4b5a3fd)) and compiled it with GGML_CUDA=ON to run on my single 3090 llama-server command without MTP: ./build/bin/…

COVERAGE [1]

How do I make MTP work in llama-server?

RELATED ENTITIES

RELATED TOPICS