English(EN) Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

Triton MoE kernel 在 AMD 和 NVIDIA 上实现高性能

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-27 12:58

一个新实现的、完全用 Triton 编写的 Fused Mixture-of-Experts (MoE) dispatch kernel，其性能达到了 Stanford 的 Megablocks 库的 89-131%。该 kernel 尤其值得注意的是，无需任何代码修改即可在 AMD MI300X 硬件上运行。主要优化在于融合了 gate 和 projection 操作，通过将中间结果保留在寄存器中，显著减少了全局内存流量。 AI

影响实现了更高效的 MoE 模型推理，有可能在包括 AMD GPU 在内的多样化硬件上提升性能。

排序理由该集群描述了一种特定 AI 模型架构的新 kernel 实现和基准测试结果，属于研究范畴。[lever_c_demoted from research: ic=1 ai=1.0]

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/bassrehab · 2026-05-27 12:58

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tp4u0u/fused_moe_dispatch_kernel_in_pure_triton_89131_of/"> <img alt="Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes" src="https://preview.redd.it/5ktg9x36…

报道来源 [1]

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

相关实体

相关话题