LLMs and new frameworks boost GPU kernel optimization

By PulseAugur Editorial · [15 sources] · 2026-05-22 00:00

Researchers are exploring novel ways to optimize GPU kernel performance for large language models. One approach uses language models as surrogates to predict kernel performance, significantly increasing the number of candidates considered within a limited budget. Another method, STOF, accelerates sparse Transformers by optimizing multi-head attention and fusing downstream operators. Additionally, a new framework called KLineage learns optimization skills from expert kernels to guide LLMs, while Xe-Forge automates kernel optimization for Intel GPUs using a multi-stage pipeline. Finally, FastKernels addresses the gap between benchmark performance and real-world deployment by creating production-aligned benchmarks and inference frameworks. AI

IMPACT New LLM-based techniques and benchmarks promise to accelerate GPU kernel optimization, potentially leading to faster AI model inference and deployment.

RANK_REASON Multiple research papers introducing new methods and benchmarks for GPU kernel optimization using LLMs and other techniques.

Read on Hugging Face Daily Papers →

paper
infra

AI-generated summary · Google Gemini · from 15 sources. How we write summaries →

LLMs and new frameworks boost GPU kernel optimization

COVERAGE [15]

arXiv cs.AI TIER_1 English(EN) · Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal · 2026-06-01 04:00

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

arXiv:2605.31464v1 Announce Type: cross Abstract: GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth sign…
arXiv cs.AI TIER_1 English(EN) · Mohit Bansal · 2026-05-29 15:56

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, b…
arXiv cs.LG TIER_1 English(EN) · Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun · 2026-05-29 04:00

Accelerating Sparse Transformer Inference on GPU

arXiv:2506.06095v5 Announce Type: replace Abstract: Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topi…
arXiv cs.AI TIER_1 English(EN) · Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao · 2026-05-28 04:00

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

arXiv:2605.28213v1 Announce Type: new Abstract: LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from…
arXiv cs.AI TIER_1 English(EN) · Marcin Spoczynski, Daniel Fleischer, Moshe Berchansky, Gabriela Ben-Melech Stan, Shira Guskin, Weilin Xu, Adam Siemieniuk, Alexander Heinecke · 2026-05-27 04:00

Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU

arXiv:2605.26118v1 Announce Type: cross Abstract: Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarou…
arXiv cs.AI TIER_1 Deutsch(DE) · Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari · 2026-05-25 04:00

FastKernels: Benchmarking GPU Kernel Generation in Production

arXiv:2605.23215v1 Announce Type: cross Abstract: LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks…
Hugging Face Daily Papers TIER_1 Deutsch(DE) · 2026-05-22 04:19

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synth…
arXiv cs.CL TIER_1 Deutsch(DE) · Samyam Rajbhandari · 2026-05-22 04:19

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synth…
Hugging Face Daily Papers TIER_1 Deutsch(DE) · 2026-05-22 00:00

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment.
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-29 08:43

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

<p>UC Berkeley's UCCL team releases mKernel, fusing intra-node NVLink, inter-node RDMA, and dense compute into a single persistent CUDA kernel.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/29/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven…
Medium — MLOps tag TIER_1 English(EN) · Parv Agarwal · 2026-05-27 08:51

The Hidden Problem With Long-Running GPU Training Workflows

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@agarwalparv/the-hidden-problem-with-long-running-gpu-training-workflows-2b3b99488217?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1280/1*iw84uev3h__z68U23Wga4A.jpeg" …
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-29 08:54

UC Berkeley's UCCL team releases mKernel, a fused CUDA kernel library that merges GPU communication and compute into a single persistent kernel. Communication c

UC Berkeley's UCCL team releases mKernel, a fused CUDA kernel library that merges GPU communication and compute into a single persistent kernel. Communication can consume over 40% of AI training time - this approach aims to eliminate that bottleneck. https://www. marktechpost.com…

LINKS marktechpost.com/…/meet-mkernel-a-multi-g…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-29 08:54

UC Berkeley's UCCL team releases mKernel, a fused CUDA kernel library that merges GPU communication and compute into a single persistent kernel. Communication c

UC Berkeley's UCCL team releases mKernel, a fused CUDA kernel library that merges GPU communication and compute into a single persistent kernel. Communication can consume over 40% of AI training time - this approach aims to eliminate that bottleneck. https://www. marktechpost.com…

LINKS marktechpost.com/…/meet-mkernel-a-multi-g…
r/LocalLLaMA TIER_1 Deutsch(DE) · /u/comperr · 2026-05-28 04:13

Heterogeneous GPU Weighting & Layer Splitting

<div class="md"><p>This is what I worked on today. With local LLM of course. So if I didn't write the code, did I really work on it? Who cares. It was my idea and I simply asked it to implement it. I basically downloaded /main/ branch, which is totally broken for W…
r/MachineLearning TIER_1 English(EN) · /u/traceml-ai · 2026-05-27 11:24

Profiling PyTorch training without accidentally stalling the GPU [D]

<div class="md"><p>Profiling PyTorch training has an interesting measurement problem: the more you measure, the more you can change the behavior of the run itself.</p> <p>A simple example is <code>torch.cuda.synchronize()</code>. It gives cleaner timing boundaries,…

COVERAGE [15]

RELATED ENTITIES

RELATED TOPICS