English(EN) We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks.

Perplexity 的研究表明 NVIDIA GB200 在 LLM 推理方面表现出色

作者 PulseAugur 编辑部 · [4 个来源] · 2026-05-12 14:17

Perplexity 发布了一项研究，详细介绍了他们如何在 NVIDIA 的 GB200 NVL72 Blackwell 机架上部署大型语言模型，特别是 Qwen3 235B。研究结果表明，与之前的 NVIDIA 硬件相比，GB200 平台在大型模型推理方面提供了显著的改进，具有更低的延迟和更高的吞吐量。这项研究强调了 GB200 在训练和高吞吐量推理方面的能力，特别是对于专家混合（MoE）模型。 AI

影响 NVIDIA 的 GB200 Blackwell 平台在 LLM 推理速度和成本效益方面显示出显著的提升，可能加速大型模型的部署。

排序理由该集群包含 Perplexity 发布的研究，内容涉及 LLM 推理硬件。

在 X — Perplexity 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

Perplexity 的研究表明 NVIDIA GB200 在 LLM 推理方面表现出色

报道来源 [4]

X — Perplexity TIER_1 English(EN) · perplexity_ai · 2026-05-12 14:17

NVIDIA 仍然是大规模大型模型推理的最强平台。Prefill/decode 分离、Blackwell 原生量化、自定义内核，以及

This NVIDIA remains the strongest platform for large-model inference at scale. Prefill/decode disaggregation, Blackwell-native quantization, custom kernels, and rack-scale NVLink turn GB200 into faster answers lower serving cost. Read the full paper here
X — Perplexity TIER_1 English(EN) · perplexity_ai · 2026-05-12 14:17

基准测试显示差距。NVLS all-reduce延迟从H200上的586.1µs降至GB200上的313.3µs。在EP=4的MoE预填充中，combine从730.1µs降至438.5µs

The benchmarks show the gap. NVLS all-reduce latency drops from 586.1µs on H200 to 313.3µs on GB200. In MoE prefill at EP=4, combine falls from 730.1µs to 438.5µs. For decode, GB200 sustains much higher throughput at high token speeds.
X — Perplexity TIER_1 English(EN) · perplexity_ai · 2026-05-12 14:17

预填充和解码对硬件的压力不同。预填充受计算限制，因此 Blackwell Tensor Cores、内存带宽、NVLink 和 SHARP 缩减有所帮助。解码

Prefill and decode stress hardware differently. Prefill is compute-bound, so Blackwell Tensor Cores, memory bandwidth, NVLink, and SHARP reductions help. Decode is latency/memory-bound, where GB200’s rack-scale NVLink domain opens up parallelism Hopper could not.
X — Perplexity TIER_1 English(EN) · perplexity_ai · 2026-05-12 14:17

我们发布了关于如何在 NVIDIA GB200 NVL72 Blackwell 机架上部署经过后训练的 Qwen3 235B 模型的新研究。

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform. https://t.co/yYZuPRXWzr

报道来源 [4]

NVIDIA 仍然是大规模大型模型推理的最强平台。Prefill/decode 分离、Blackwell 原生量化、自定义内核，以及

基准测试显示差距。NVLS all-reduce延迟从H200上的586.1µs降至GB200上的313.3µs。在EP=4的MoE预填充中，combine从730.1µs降至438.5µs

预填充和解码对硬件的压力不同。预填充受计算限制，因此 Blackwell Tensor Cores、内存带宽、NVLink 和 SHARP 缩减有所帮助。解码

我们发布了关于如何在 NVIDIA GB200 NVL72 Blackwell 机架上部署经过后训练的 Qwen3 235B 模型的新研究。

相关实体

相关话题