PulseAugur
EN
LIVE 04:49:52

GLM-5.2 model runs at 7.3 tok/s locally with 4x RTX 3090s

A user has detailed their experience running the GLM-5.2 UD-IQ2_M model locally, achieving approximately 7.3 tokens per second across four RTX 3090 GPUs and 192GB of RAM. They found that halving the quantization level (from IQ2 to IQ1) had no impact on speed, while increasing CPU threads from 6 to 12 resulted in a 22% performance boost. The user concluded that decode speed is primarily limited by CPU compute for offloaded experts rather than memory bandwidth, and that disabling the model's "thinking" or reasoning capabilities significantly speeds up response times. AI

IMPACT Provides insights into optimizing local LLM inference performance and hardware utilization.

RANK_REASON User-generated guide on running a specific LLM locally with custom hardware configuration.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

GLM-5.2 model runs at 7.3 tok/s locally with 4x RTX 3090s

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Important_Quote_1180 ·

    GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

    <!-- SC_OFF --><div class="md"><p>TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs a…