Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]
Researchers have developed TritonMoE, a new inference kernel for Mixture-of-Experts (MoE) models written entirely in OpenAI's Triton language. This kernel achieves cross-platform compatibility, running on both NVIDIA and AMD hardware without vendor-specific code. It demonstrates significant performance gains, outperforming existing methods like Megablocks in throughput for shorter token sequences, though it faces limitations with very long contexts or a high number of experts. AI
IMPACT Enables more efficient and portable inference for Mixture-of-Experts models across different hardware architectures.