PulseAugur
EN
LIVE 01:24:57

GLM-5.2-FP8 deployed with 262k context on HGX-H200

A user shared their Docker deployment configuration for GLM-5.2-FP8 on an HGX-H200 system using SGLang. The configuration achieved a 262k context window and a throughput of 70 tokens per second. The user noted that certain flags, like DP and moe-a2a-backend, were disabled to optimize performance, and that official vLLM recipes did not work for H200 due to FP8 quantization on the DSV3 architecture. AI

IMPACT Provides insights into optimizing large context windows and throughput for specific hardware configurations.

RANK_REASON User-shared deployment configuration for a specific model and hardware setup.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Soft-Wedding4595 ·

    My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

    <!-- SC_OFF --><div class="md"><p>Halo lads. Name says it all. Right now, after 1-2 hours of experimenting, this is maximum i could squeeze out current hardware</p> <p>No, im not rich. Its my companies GPUs, just sharing my experience</p> <pre><code>docker run -d \ --name glm-5.2…