PulseAugur
EN
LIVE 06:41:46

Frontier LLMs struggle with multi-GPU kernel generation, new benchmark reveals

A new benchmark called ParallelKernelBench (PKB) has been developed to evaluate the ability of frontier large language models to generate efficient multi-GPU kernels. Testing models like GPT-5.5, Gemini 3 Pro, and Opus 4.7 revealed significant performance gaps, with less than a third of problems solved correctly and fewer than a quarter of those outperforming a naive baseline. The benchmark focuses on replacing PyTorch + NCCL with direct CUDA kernels over NVLink, addressing the critical communication overhead that often bottlenecks AI inference. AI

IMPACT Highlights limitations in current LLMs for optimizing multi-GPU communication, a key bottleneck for large-scale AI inference.

RANK_REASON The item describes a new benchmark and evaluation framework for LLM-generated code, including performance results for frontier models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Together AI blog →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Frontier LLMs struggle with multi-GPU kernel generation, new benchmark reveals

COVERAGE [1]

  1. Together AI blog TIER_1 English(EN) ·

    ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

    ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads. The best model solves under a third, but a few generated kernels beat any public implementation.