PulseAugur / Brief
EN
LIVE 07:55:46

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

    A user has detailed their experience running the GLM-5.2 UD-IQ2_M model locally, achieving approximately 7.3 tokens per second across four RTX 3090 GPUs and 192GB of RAM. They found that halving the quantization level (from IQ2 to IQ1) had no impact on speed, while increasing CPU threads from 6 to 12 resulted in a 22% performance boost. The user concluded that decode speed is primarily limited by CPU compute for offloaded experts rather than memory bandwidth, and that disabling the model's "thinking" or reasoning capabilities significantly speeds up response times. AI

    GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

    IMPACT Provides insights into optimizing local LLM inference performance and hardware utilization.