PulseAugur
EN
LIVE 12:09:22

MPK compiler fuses multi-GPU inference into single mega-kernel

Researchers have developed MPK, a novel compiler and runtime system designed to optimize multi-GPU model inference by transforming operations into a single, high-performance mega-kernel. This system utilizes an SM-level graph representation to enable advanced optimizations like cross-operator software pipelining and fine-grained overlap of computation and communication. Evaluations demonstrate that MPK significantly reduces end-to-end inference latency, achieving up to 1.7x improvement and pushing LLM inference performance closer to hardware limits. AI

IMPACT Optimizes LLM inference performance, potentially reducing latency and improving hardware utilization for AI operators.

RANK_REASON The cluster contains an academic paper detailing a new compiler and runtime system for optimizing tensor programs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao… ·

    MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

    arXiv:2512.22219v2 Announce Type: replace-cross Abstract: We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance mega-kernel. MPK introduces an SM-level graph repres…