PulseAugur
EN
LIVE 17:42:48

Qwen3.6-27B model achieves 80 TPS with 218k context on single RTX 5090

A user on Reddit's r/LocalLLaMA community has shared details on achieving high performance with the Qwen3.6-27B model. By utilizing the NVFP4 with MTP quantization and the vLLM 0.19 inference server, they reported approximately 80 tokens per second with a 218,000 token context window on a single RTX 5090 graphics card. This setup builds upon previous experiments with the Qwen3.5-27B model, demonstrating significant advancements in local LLM deployment efficiency. AI

IMPACT Demonstrates efficient local deployment of large context models, potentially lowering barriers for advanced LLM use on consumer hardware.

RANK_REASON Release of a specific model version with performance metrics shared by a community member.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Qwen3.6-27B model achieves 80 TPS with 218k context on single RTX 5090

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Kindly-Cantaloupe978 ·

    Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

    <!-- SC_OFF --><div class="md"><p>Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: <a href="https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP">https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP</a></p> <p>Can follow the…