PulseAugur
EN
LIVE 08:24:06

GLM 5.1 achieves 40 tokens/sec locally on RTX 6000 Pro cards

A user on the r/LocalLLaMA subreddit has successfully optimized the GLM 5.1 model for local deployment, achieving impressive performance metrics. By applying specific patches to the sglang inference software and utilizing four RTX 6000 Pro GPUs, they reported a throughput of 40 tokens per second and over 2000 tokens per second for prefilled contexts. The user noted that the current inference software is not fully optimized for these cards, suggesting further performance gains are possible. AI

IMPACT Demonstrates potential for high-throughput local LLM inference with optimized hardware and software configurations.

RANK_REASON User-reported performance optimization of an open-source model on specific hardware.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/val_in_tech ·

    GLM 5.1 Locally: 40tps, 2000+ pp/s

    <!-- SC_OFF --><div class="md"><p>After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for th…