Gemma-4 31B model achieves 1168 tokens/sec on single RTX 6000 PRO GPU

By PulseAugur Editorial · [1 sources] · 2026-06-30 12:35

A technical blog post details the performance of the Gemma-4 31B model when run with vLLM on a single RTX 6000 PRO Blackwell GPU. The setup achieved a peak throughput of approximately 1,168 tokens per second with 24 concurrent requests, demonstrating significant capacity and headroom. While median time-to-first-token remained fast at around 0.7 seconds, the tail latency (p99) increased to approximately 19 seconds under heavy load, highlighting this as a key metric for scaling. AI

IMPACT Demonstrates the inference capabilities of specific LLM and hardware configurations for production workloads.

RANK_REASON Blog post detailing performance benchmarks of an LLM on specific hardware.

Read on dev.to — LLM tag →

infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Gemma-4 31B model achieves 1168 tokens/sec on single RTX 6000 PRO GPU

COVERAGE [1]

dev.to — LLM tag TIER_1 English(EN) · Nikhil · 2026-06-30 12:35

Gemma-4 31B + vLLM + RTX 6000 PRO : 1168 tokens/sec and still asking for more...

We pushed Gemma-4 31B to 24 concurrent requests on a single RTX 6000 PRO Blackwell. The queue never filled. ~1.17k tokens/sec, and it still had headroom. Most LLM "benchmarks" show you one request at a time. That tells you almost nothing about production. So we …

COVERAGE [1]

Gemma-4 31B + vLLM + RTX 6000 PRO : 1168 tokens/sec and still asking for more...

RELATED ENTITIES

RELATED TOPICS