MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs
Researchers have developed MPK, a novel compiler and runtime system designed to optimize multi-GPU model inference by transforming operations into a single, high-performance mega-kernel. This system utilizes an SM-level graph representation to enable advanced optimizations like cross-operator software pipelining and fine-grained overlap of computation and communication. Evaluations demonstrate that MPK significantly reduces end-to-end inference latency, achieving up to 1.7x improvement and pushing LLM inference performance closer to hardware limits. AI
IMPACT Optimizes LLM inference performance, potentially reducing latency and improving hardware utilization for AI operators.