PulseAugur
EN
LIVE 06:38:16

RTX 5090 struggles to exceed 250 TPS with Qwen3.5-4B model

A user on Reddit's r/LocalLLaMA forum is experiencing performance issues with the Qwen3.5-4B model on an RTX 5090 GPU. Despite using a high-end GPU, the user is only achieving around 250 tokens per second, significantly lower than expected for a small model. They have tried various configurations, including different Docker images and LM Studio, but the bottleneck persists, with low GPU utilization. AI

IMPACT User reports low performance with a small model on high-end hardware, indicating potential optimization issues.

RANK_REASON User is reporting a performance issue with a specific model and hardware configuration.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/luckyj ·

    Can't get over 250TPS on RTX5090 with Qwen3.5-4B

    <!-- SC_OFF --><div class="md"><p>My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps.</p> <p>I'm b…