PulseAugur
EN
LIVE 23:11:15

LLMs struggle to generate multi-GPU kernels, researchers find

Researchers at Together have found that while large language models can efficiently generate single-GPU kernels, they struggle significantly with multi-GPU kernel generation. These models perform poorly when asked to create kernels optimized for multiple GPUs, often failing to compile or producing incorrect results. This limitation stems from the difference in bottlenecks between single-GPU (compute/memory bandwidth) and multi-GPU (interconnect) operations, which current LLMs do not effectively handle. AI

IMPACT Highlights a current limitation in LLM capabilities for complex parallel programming tasks, potentially impacting AI infrastructure development.

RANK_REASON Research findings on LLM capabilities in generating multi-GPU kernels.

Read on X — Together (inference / OSS) →

AI-generated summary · Google Gemini · from 4 sources. How we write summaries →

LLMs struggle to generate multi-GPU kernels, researchers find

COVERAGE [4]

  1. X — Together (inference / OSS) TIER_1 English(EN) · togethercompute ·

    An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps.

    An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps. Feedback fixes syntax, not rank coordination, collective ordering, or transfer-mechanism choice. TMA and NVLS stay almost unused. https://t.co/VKER…

  2. X — Together (inference / OSS) TIER_1 Dansk(DA) · togethercompute ·

    Frontier models struggle.

    Frontier models struggle. → Best zero-shot: 28/87 correct, 22 beat the PyTorch + NCCL baseline → With 3 attempts: 36/87 correct, but fast1@3 tops out at 31% Weak models fail to compile. Strong reasoners compile cleanly and return wrong answers. https://t.co/1fgoyZRmuH

  3. X — Together (inference / OSS) TIER_1 English(EN) · togethercompute ·

    But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect.

    But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect. Each PKB task hands the model a PyTorch + NCCL reference and asks it to communicate directly across GPUs via symmetric memory. …

  4. X — Together (inference / OSS) TIER_1 English(EN) · togethercompute ·

    LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

    LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart. ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL. New research from Willy ht…