Triton MoE kernel achieves high performance on AMD, NVIDIA

By PulseAugur Editorial · [1 sources] · 2026-05-27 12:58

A new fused Mixture-of-Experts (MoE) dispatch kernel, written entirely in Triton, achieves 89-131% of the performance of Stanford's Megablocks library. This kernel notably runs on AMD MI300X hardware without any code modifications. The primary optimization involves fusing gate and projection operations, which significantly reduces global memory traffic by keeping intermediate results in registers. AI

IMPACT Enables more efficient MoE model inference, potentially improving performance on diverse hardware including AMD GPUs.

RANK_REASON The cluster describes a new kernel implementation and benchmark results for a specific AI model architecture, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

infra
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Triton MoE kernel achieves high performance on AMD, NVIDIA

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/bassrehab · 2026-05-27 12:58

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tp4u0u/fused_moe_dispatch_kernel_in_pure_triton_89131_of/"> <img alt="Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes" src="https://preview.redd.it/5ktg9x36…

COVERAGE [1]

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

RELATED ENTITIES

RELATED TOPICS