PulseAugur
实时 23:05:26

Qwen3.6-27B model achieves 80 TPS with 218k context on single RTX 5090

A user on Reddit's r/LocalLLaMA community has shared details on achieving high performance with the Qwen3.6-27B model. By utilizing the NVFP4 with MTP quantization and the vLLM 0.19 inference server, they reported approximately 80 tokens per second with a 218,000 token context window on a single RTX 5090 graphics card. This setup builds upon previous experiments with the Qwen3.5-27B model, demonstrating significant advancements in local LLM deployment efficiency. AI

影响 Demonstrates efficient local deployment of large context models, potentially lowering barriers for advanced LLM use on consumer hardware.

排序理由 Release of a specific model version with performance metrics shared by a community member.

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

报道来源 [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Kindly-Cantaloupe978 ·

    Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

    <!-- SC_OFF --><div class="md"><p>Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: <a href="https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP">https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP</a></p> <p>Can follow the…