English(EN) GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

GLM-5.2 模型在 4x RTX 3090 上本地运行速度为 7.3 tok/s

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-19 00:06

一位用户详细介绍了在本地运行 GLM-5.2 UD-IQ2_M 模型时的体验，在四块 RTX 3090 GPU 和 192GB RAM 上实现了约 7.3 tokens/秒的吞吐量。他们发现将量化级别从 IQ2 减半到 IQ1 对速度没有影响，而将 CPU 线程从 6 增加到 12 则使性能提升了 22%。用户得出结论，解码速度主要受限于卸载专家（offloaded experts）的 CPU 计算能力，而非内存带宽，并且禁用模型的“思考”或推理能力可以显著加快响应时间。 AI

影响提供了优化本地 LLM 推理性能和硬件利用率的见解。

排序理由用户生成的关于在自定义硬件配置下本地运行特定 LLM 的指南。

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

GLM-5.2 模型在 4x RTX 3090 上本地运行速度为 7.3 tok/s

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Important_Quote_1180 · 2026-06-19 00:06

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

<div class="md"><p>TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs a…

报道来源 [1]

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

相关实体

相关话题