Deutsch(DE) FastKernels: Benchmarking GPU Kernel Generation in Production

大型语言模型和新框架助力 GPU 内核优化

作者 PulseAugur 编辑部 · [15 个来源] · 2026-05-22 00:00

研究人员正在探索优化大型语言模型 GPU 内核性能的新方法。一种方法使用语言模型作为代理来预测内核性能，在有限预算内显著增加考虑的候选数量。另一种方法 STOF 通过优化多头注意力和融合下游算子来加速稀疏 Transformer。此外，一个名为 KLineage 的新框架从专家内核中学习优化技能来指导大型语言模型，而 Xe-Forge 使用多阶段管道为 Intel GPU 自动化内核优化。最后，FastKernels 通过创建与生产环境对齐的基准测试和推理框架，解决了基准测试性能与实际部署之间的差距。 AI

影响基于大型语言模型的新技术和基准测试有望加速 GPU 内核优化，从而可能加快 AI 模型推理和部署。

排序理由多篇研究论文介绍了使用大型语言模型和其他技术进行 GPU 内核优化的新方法和基准测试。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 15 个来源。我们如何撰写摘要 →

报道来源 [15]

arXiv cs.AI TIER_1 English(EN) · Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal · 2026-06-01 04:00

GPU预测者：语言模型作为内核运行时优化的选择性代理

arXiv:2605.31464v1 Announce Type: cross Abstract: GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth sign…
arXiv cs.AI TIER_1 English(EN) · Mohit Bansal · 2026-05-29 15:56

GPU预测者：语言模型作为内核运行时优化的选择性代理

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, b…
arXiv cs.LG TIER_1 English(EN) · Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun · 2026-05-29 04:00

GPU上加速稀疏Transformer推理

arXiv:2506.06095v5 Announce Type: replace Abstract: Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topi…
arXiv cs.AI TIER_1 English(EN) · Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao · 2026-05-28 04:00

学习何时优化：来自专家 GPU 内核谱系的已验证优化技能

arXiv:2605.28213v1 Announce Type: new Abstract: LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from…
arXiv cs.AI TIER_1 English(EN) · Marcin Spoczynski, Daniel Fleischer, Moshe Berchansky, Gabriela Ben-Melech Stan, Shira Guskin, Weilin Xu, Adam Siemieniuk, Alexander Heinecke · 2026-05-27 04:00

Xe-Forge：英特尔 GPU 的多阶段 LLM 驱动内核优化

arXiv:2605.26118v1 Announce Type: cross Abstract: Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarou…
arXiv cs.AI TIER_1 Deutsch(DE) · Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari · 2026-05-25 04:00

FastKernels：生产环境中GPU核函数生成的基准测试

arXiv:2605.23215v1 Announce Type: cross Abstract: LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks…
Hugging Face Daily Papers TIER_1 Deutsch(DE) · 2026-05-22 04:19

FastKernels：生产环境中GPU内核生成的基准测试

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synth…
arXiv cs.CL TIER_1 Deutsch(DE) · Samyam Rajbhandari · 2026-05-22 04:19

FastKernels：生产环境中GPU内核生成的基准测试

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synth…
Hugging Face Daily Papers TIER_1 Deutsch(DE) · 2026-05-22 00:00

FastKernels：生产环境中GPU内核生成的基准测试

FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment.
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-29 08:43

认识 mKernel：一个用于 GPU 驱动通信的多 GPU、多节点融合内核库

<p>UC Berkeley's UCCL team releases mKernel, fusing intra-node NVLink, inter-node RDMA, and dense compute into a single persistent CUDA kernel.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/29/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven…
Medium — MLOps tag TIER_1 English(EN) · Parv Agarwal · 2026-05-27 08:51

长期运行GPU训练工作流的隐藏问题

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@agarwalparv/the-hidden-problem-with-long-running-gpu-training-workflows-2b3b99488217?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1280/1*iw84uev3h__z68U23Wga4A.jpeg" …
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-29 08:54

加州大学伯克利分校UCCL团队发布mKernel，一个将GPU通信和计算合并到单个持久内核中的融合CUDA内核库。通信c

UC Berkeley's UCCL team releases mKernel, a fused CUDA kernel library that merges GPU communication and compute into a single persistent kernel. Communication can consume over 40% of AI training time - this approach aims to eliminate that bottleneck. https://www. marktechpost.com…

链接 marktechpost.com/…/meet-mkernel-a-multi-g…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-29 08:54

加州大学伯克利分校UCCL团队发布mKernel，一个将GPU通信和计算合并到单个持久内核中的融合CUDA内核库。通信c

UC Berkeley's UCCL team releases mKernel, a fused CUDA kernel library that merges GPU communication and compute into a single persistent kernel. Communication can consume over 40% of AI training time - this approach aims to eliminate that bottleneck. https://www. marktechpost.com…

链接 marktechpost.com/…/meet-mkernel-a-multi-g…
r/LocalLLaMA TIER_1 Deutsch(DE) · /u/comperr · 2026-05-28 04:13

异构GPU加权与层拆分

<div class="md"><p>This is what I worked on today. With local LLM of course. So if I didn't write the code, did I really work on it? Who cares. It was my idea and I simply asked it to implement it. I basically downloaded /main/ branch, which is totally broken for W…
r/MachineLearning TIER_1 English(EN) · /u/traceml-ai · 2026-05-27 11:24

剖析 PyTorch 训练，避免意外导致 GPU 停滞 [D]

<div class="md"><p>Profiling PyTorch training has an interesting measurement problem: the more you measure, the more you can change the behavior of the run itself.</p> <p>A simple example is <code>torch.cuda.synchronize()</code>. It gives cleaner timing boundaries,…

报道来源 [15]

相关实体

相关话题