PulseAugur
EN
LIVE 22:00:12

GLM-5.2 UD-IQ1_M speed test on llama.cpp shows 579 t/s prefill

A user on Reddit shared performance benchmarks for the GLM-5.2 UD-IQ1_M model running on llama.cpp. The tests utilized an RTX 5090 and an RTX 3090 Ti, reporting approximately 579 tokens/second for prefill at an 8k context window and 324 tokens/second at a 57k context window. Token generation speed, or decoding, was measured at around 10.6 tokens/second. AI

IMPACT Provides specific performance data for running large language models locally, aiding developers in hardware and software choices.

RANK_REASON User-generated performance benchmarks for a specific model and software combination.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

GLM-5.2 UD-IQ1_M speed test on llama.cpp shows 579 t/s prefill

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Shoddy_Bed3240 ·

    GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

    <!-- SC_OFF --><div class="md"><p>Just sharing some speed test numbers for GLM-5.2 running on llama.cpp.</p> <p><strong>Setup:</strong></p> <ul> <li>Model: unsloth/GLM-5.2-GGUF, UD-IQ1_M quant</li> <li>GPUs: RTX 5090 + RTX 3090 Ti</li> <li>186 GB DDR5 used</li> <li>Debian 13</li>…