GLM-5.2 speculative decode runs on 4x DGX GB10 cluster

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-24 21:23

一位用户成功在 4x DGX GB10 集群上实现了 GLM-5.2 和 MTP 投机解码，实现了约 9.4 tokens/秒的吞吐量。这涉及到从公共内核重建缺失的构建修改，并确保使用特定的 vLLM 参考提交以避免权重加载错误。用户还详细介绍了优化设置的步骤，包括一种无数据剪枝方法以将模型装入内存，以及关于多节点性能网络配置的说明。 AI

影响展示了在专用硬件上部署大型模型的先进技术，可能提高具有类似设置的用户的推理速度。

排序理由用户级别的现有模型和框架集成与优化，并非前沿发布或重大的行业事件。

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

GLM-5.2 speculative decode runs on 4x DGX GB10 cluster

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/anvarazizov · 2026-06-24 21:23

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing

<div class="md"><p>TL;DR: the recipe's image-build mods aren't actually public – I reconstructed them from the public kernels (with Claude) – and you have to build vLLM at the author's exact pinned ref or the real AWQ weights crash on load. Running now at ~9.4 tok/…

报道来源 [1]

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing

相关实体

相关话题