English(EN) After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate

DeepSeek V4 推出 MegaMoE 优化以实现高效 MoE

作者 PulseAugur 编辑部 · [4 个来源] · 2026-05-15 23:00

DeepSeek 发布了其 V4 模型，该模型通过一个名为 MegaMoE 的新系统进行了显著优化。该系统使用一个 1400 行的融合 CUDA 内核，通过对模型层内的通信和计算进行细粒度流水线处理来提高性能。这种方法解决了 Mixture-of-Experts (MoE) 模型中通常需要大量 all-to-all 通信的挑战。 AI

影响为 Mixture-of-Experts 架构引入了新颖的优化，可能提高大型模型的训练效率和推理速度。

排序理由 Frontier-lab 模型发布，附带系统卡。

在 X — SemiAnalysis 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-05-15 23:00

After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate

After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate, and TMA-store the final output. All of this happens in one kernel, with fine grained pipelining over experts. https:/…
X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-05-15 23:00

DeepSeek takes it one step further by breaking up the workload and adding fine-grained pipelining of communication and computation within a single layer. MegaMo

DeepSeek takes it one step further by breaking up the workload and adding fine-grained pipelining of communication and computation within a single layer. MegaMoE breaks the experts into waves, and instead of only overlapping TMA memory loads+tensor core math+epilogue SIMT work,
X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-05-15 23:00

The problem is that MoE with Expert Parallelism requires all-to-all communications before and after the layer to dispatch tokens to their respective experts the

The problem is that MoE with Expert Parallelism requires all-to-all communications before and after the layer to dispatch tokens to their respective experts then combine results back afterwards. Naively, these communications are separate kernel launches that don't get overlapped …
X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-05-15 23:00

As we've come to expect from a DeepSeek release, DeepSeek V4 comes with more flashy ML systems optimizations. This time? MegaMoE, a 1400 line fused CUDA kernel

As we've come to expect from a DeepSeek release, DeepSeek V4 comes with more flashy ML systems optimizations. This time? MegaMoE, a 1400 line fused CUDA kernel that computes the entire MoE forward pass. Let's see how it works (1/4) 🧵 https://t.co/rqv6y2i3JV

报道来源 [4]

After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate

DeepSeek takes it one step further by breaking up the workload and adding fine-grained pipelining of communication and computation within a single layer. MegaMo

The problem is that MoE with Expert Parallelism requires all-to-all communications before and after the layer to dispatch tokens to their respective experts the

As we've come to expect from a DeepSeek release, DeepSeek V4 comes with more flashy ML systems optimizations. This time? MegaMoE, a 1400 line fused CUDA kernel

相关实体

相关话题