Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?
A user on the r/LocalLLaMA subreddit is seeking to optimize the performance of the Qwen3.6-MTP-27B model running on a Tesla V100 GPU using llama.cpp. They are currently achieving approximately 44-55 tokens per second and are looking for configuration adjustments to increase this throughput without compromising output quality. The user has detailed their current command-line arguments, hardware specifications, and posed specific questions regarding suboptimal flags, potential optimizations for MTP settings, and the impact of a large context size on generation speed. AI
IMPACT Users are seeking to maximize inference speed for local LLM deployments, which could inform best practices for efficient model serving.