English(EN) LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

研究人员发现大型语言模型在生成多GPU内核方面存在困难

作者 PulseAugur 编辑部 · [4 个来源] · 2026-06-23 20:17

Together的研究人员发现，虽然大型语言模型能够高效地生成单GPU内核，但在多GPU内核生成方面却面临巨大挑战。当被要求创建针对多个GPU优化的内核时，这些模型表现不佳，经常无法编译或产生错误结果。这一限制源于单GPU（计算/内存带宽）和多GPU（互连）操作之间的瓶颈差异，而当前的大型语言模型无法有效处理这些差异。 AI

影响凸显了大型语言模型在复杂并行编程任务方面的当前局限性，可能影响AI基础设施的开发。

排序理由关于大型语言模型在生成多GPU内核方面能力的研究发现。

在 X — Together (inference / OSS) 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-23 20:17

An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps.

An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps. Feedback fixes syntax, not rank coordination, collective ordering, or transfer-mechanism choice. TMA and NVLS stay almost unused. https://t.co/VKER…
X — Together (inference / OSS) TIER_1 Dansk(DA) · togethercompute · 2026-06-23 20:17

Frontier models struggle.

Frontier models struggle. → Best zero-shot: 28/87 correct, 22 beat the PyTorch + NCCL baseline → With 3 attempts: 36/87 correct, but fast1@3 tops out at 31% Weak models fail to compile. Strong reasoners compile cleanly and return wrong answers. https://t.co/1fgoyZRmuH
X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-23 20:17

But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect.

But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect. Each PKB task hands the model a PyTorch + NCCL reference and asks it to communicate directly across GPUs via symmetric memory. …
X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-23 20:17

LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart. ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL. New research from Willy ht…

报道来源 [4]

An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps.

Frontier models struggle.

But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect.

LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

相关实体

相关话题