GLM 5.1 achieves 40 tokens/sec locally on RTX 6000 Pro cards

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A user on the r/LocalLLaMA subreddit has successfully optimized the GLM 5.1 model for local deployment, achieving impressive performance metrics. By applying specific patches to the sglang inference software and utilizing four RTX 6000 Pro GPUs, they reported a throughput of 40 tokens per second and over 2000 tokens per second for prefilled contexts. The user noted that the current inference software is not fully optimized for these cards, suggesting further performance gains are possible. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Demonstrates potential for high-throughput local LLM inference with optimized hardware and software configurations.

RANK_REASON User-reported performance optimization of an open-source model on specific hardware.

Read on r/LocalLLaMA →

COVERAGE [1]

r/LocalLLaMA TIER_1 · /u/val_in_tech · 2026-04-25 16:31

GLM 5.1 Locally: 40tps, 2000+ pp/s

<div class="md"><p>After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for th…

COVERAGE [1]

GLM 5.1 Locally: 40tps, 2000+ pp/s

RELATED ENTITIES

RELATED TOPICS