PulseAugur
实时 02:46:48
English(EN) After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate

DeepSeek V4 推出 MegaMoE 优化以实现高效 MoE

DeepSeek 发布了其 V4 模型,该模型通过一个名为 MegaMoE 的新系统进行了显著优化。该系统使用一个 1400 行的融合 CUDA 内核,通过对模型层内的通信和计算进行细粒度流水线处理来提高性能。这种方法解决了 Mixture-of-Experts (MoE) 模型中通常需要大量 all-to-all 通信的挑战。 AI

影响Mixture-of-Experts 架构引入了新颖的优化,可能提高大型模型的训练效率和推理速度。

排序理由 Frontier-lab 模型发布,附带系统卡。

在 X — SemiAnalysis 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

DeepSeek V4 推出 MegaMoE 优化以实现高效 MoE

报道来源 [4]

  1. X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ ·

    After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate

    After Linear 2, BF16 results go directly to remote combine buffers over NVLink. Epilogue warps do per-token top-k reduction with double-buffered TMA, accumulate, and TMA-store the final output. All of this happens in one kernel, with fine grained pipelining over experts. https:/…

  2. X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ ·

    DeepSeek takes it one step further by breaking up the workload and adding fine-grained pipelining of communication and computation within a single layer. MegaMo

    DeepSeek takes it one step further by breaking up the workload and adding fine-grained pipelining of communication and computation within a single layer. MegaMoE breaks the experts into waves, and instead of only overlapping TMA memory loads+tensor core math+epilogue SIMT work,

  3. X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ ·

    The problem is that MoE with Expert Parallelism requires all-to-all communications before and after the layer to dispatch tokens to their respective experts the

    The problem is that MoE with Expert Parallelism requires all-to-all communications before and after the layer to dispatch tokens to their respective experts then combine results back afterwards. Naively, these communications are separate kernel launches that don't get overlapped …

  4. X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ ·

    As we've come to expect from a DeepSeek release, DeepSeek V4 comes with more flashy ML systems optimizations. This time? MegaMoE, a 1400 line fused CUDA kernel

    As we've come to expect from a DeepSeek release, DeepSeek V4 comes with more flashy ML systems optimizations. This time? MegaMoE, a 1400 line fused CUDA kernel that computes the entire MoE forward pass. Let's see how it works (1/4) 🧵 https://t.co/rqv6y2i3JV