PulseAugur
实时 23:13:27
English(EN) LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

研究人员发现大型语言模型在生成多GPU内核方面存在困难

Together的研究人员发现,虽然大型语言模型能够高效地生成单GPU内核,但在多GPU内核生成方面却面临巨大挑战。当被要求创建针对多个GPU优化的内核时,这些模型表现不佳,经常无法编译或产生错误结果。这一限制源于单GPU(计算/内存带宽)和多GPU(互连)操作之间的瓶颈差异,而当前的大型语言模型无法有效处理这些差异。 AI

影响 凸显了大型语言模型在复杂并行编程任务方面的当前局限性,可能影响AI基础设施的开发。

排序理由 关于大型语言模型在生成多GPU内核方面能力的研究发现。

在 X — Together (inference / OSS) 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

研究人员发现大型语言模型在生成多GPU内核方面存在困难

报道来源 [4]

  1. X — Together (inference / OSS) TIER_1 English(EN) · togethercompute ·

    An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps.

    An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps. Feedback fixes syntax, not rank coordination, collective ordering, or transfer-mechanism choice. TMA and NVLS stay almost unused. https://t.co/VKER…

  2. X — Together (inference / OSS) TIER_1 Dansk(DA) · togethercompute ·

    Frontier models struggle.

    Frontier models struggle. → Best zero-shot: 28/87 correct, 22 beat the PyTorch + NCCL baseline → With 3 attempts: 36/87 correct, but fast1@3 tops out at 31% Weak models fail to compile. Strong reasoners compile cleanly and return wrong answers. https://t.co/1fgoyZRmuH

  3. X — Together (inference / OSS) TIER_1 English(EN) · togethercompute ·

    But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect.

    But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect. Each PKB task hands the model a PyTorch + NCCL reference and asks it to communicate directly across GPUs via symmetric memory. …

  4. X — Together (inference / OSS) TIER_1 English(EN) · togethercompute ·

    LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

    LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart. ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL. New research from Willy ht…