GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster
A user has detailed their experience running the GLM-5.2 UD-IQ2_M model locally, achieving approximately 7.3 tokens per second across four RTX 3090 GPUs and 192GB of RAM. They found that halving the quantization level (from IQ2 to IQ1) had no impact on speed, while increasing CPU threads from 6 to 12 resulted in a 22% performance boost. The user concluded that decode speed is primarily limited by CPU compute for offloaded experts rather than memory bandwidth, and that disabling the model's "thinking" or reasoning capabilities significantly speeds up response times. AI
IMPACT Provides insights into optimizing local LLM inference performance and hardware utilization.