PulseAugur
EN
LIVE 08:17:12

Gemma-4 31B model achieves 1168 tokens/sec on single RTX 6000 PRO GPU

A technical blog post details the performance of the Gemma-4 31B model when run with vLLM on a single RTX 6000 PRO Blackwell GPU. The setup achieved a peak throughput of approximately 1,168 tokens per second with 24 concurrent requests, demonstrating significant capacity and headroom. While median time-to-first-token remained fast at around 0.7 seconds, the tail latency (p99) increased to approximately 19 seconds under heavy load, highlighting this as a key metric for scaling. AI

IMPACT Demonstrates the inference capabilities of specific LLM and hardware configurations for production workloads.

RANK_REASON Blog post detailing performance benchmarks of an LLM on specific hardware.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Gemma-4 31B model achieves 1168 tokens/sec on single RTX 6000 PRO GPU

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Nikhil ·

    Gemma-4 31B + vLLM + RTX 6000 PRO : 1168 tokens/sec and still asking for more...

    <p>We pushed Gemma-4 31B to 24 concurrent requests on a single RTX 6000 PRO Blackwell. The queue never filled. ~1.17k tokens/sec, and it still had headroom.</p> <p>Most LLM "benchmarks" show you one request at a time. That tells you almost nothing about production. </p> <p>So we …