GLM-5.2 UD-IQ1_M speed test on llama.cpp shows 579 t/s prefill

By PulseAugur Editorial · [1 sources] · 2026-06-22 14:17

A user on Reddit shared performance benchmarks for the GLM-5.2 UD-IQ1_M model running on llama.cpp. The tests utilized an RTX 5090 and an RTX 3090 Ti, reporting approximately 579 tokens/second for prefill at an 8k context window and 324 tokens/second at a 57k context window. Token generation speed, or decoding, was measured at around 10.6 tokens/second. AI

IMPACT Provides specific performance data for running large language models locally, aiding developers in hardware and software choices.

RANK_REASON User-generated performance benchmarks for a specific model and software combination.

Read on r/LocalLLaMA →

infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

GLM-5.2 UD-IQ1_M speed test on llama.cpp shows 579 t/s prefill

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Shoddy_Bed3240 · 2026-06-22 14:17

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

<div class="md">Just sharing some speed test numbers for GLM-5.2 running on llama.cpp. Setup: <ul> <li>Model: unsloth/GLM-5.2-GGUF, UD-IQ1_M quant</li> <li>GPUs: RTX 5090 + RTX 3090 Ti</li> <li>186 GB DDR5 used</li> <li>Debian 13</li>…

COVERAGE [1]

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

RELATED ENTITIES

RELATED TOPICS