A new benchmark called ParallelKernelBench (PKB) has been developed to evaluate the ability of frontier large language models to generate efficient multi-GPU kernels. Testing models like GPT-5.5, Gemini 3 Pro, and Opus 4.7 revealed significant performance gaps, with less than a third of problems solved correctly and fewer than a quarter of those outperforming a naive baseline. The benchmark focuses on replacing PyTorch + NCCL with direct CUDA kernels over NVLink, addressing the critical communication overhead that often bottlenecks AI inference. AI
IMPACT Highlights limitations in current LLMs for optimizing multi-GPU communication, a key bottleneck for large-scale AI inference.
RANK_REASON The item describes a new benchmark and evaluation framework for LLM-generated code, including performance results for frontier models. [lever_c_demoted from research: ic=1 ai=1.0]
- CUDA
- Gemini 3 Pro
- GPT-5.5
- NCCL
- NVIDIA NeMo-RL
- NVLink
- Opus 4.7
- ParallelKernelBench
- PyTorch
- Together AI
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →