New benchmark tests LLMs on real-world code optimization

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have introduced SWE-fficiency, a new benchmark designed to evaluate the performance optimization capabilities of language models on real-world software repositories. The benchmark includes 498 tasks across nine popular data-science, machine-learning, and HPC repositories, such as NumPy and Pandas. It challenges agents to analyze code, identify performance bottlenecks, and propose patches that match or exceed expert speedups while passing all unit tests. Initial evaluations show that current state-of-the-art agents significantly underperform, achieving less than 0.23x the expert speedup due to difficulties in localization, cross-function reasoning, and maintaining code correctness. AI

IMPACT This benchmark could accelerate research into LLMs capable of complex, long-horizon reasoning for software performance optimization.

RANK_REASON The cluster contains a research paper detailing a new benchmark for evaluating LLMs on software engineering tasks. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark tests LLMs on real-world code optimization

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, Parthasarathy Ranganathan · 2026-06-30 04:00

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

arXiv:2511.06090v3 Announce Type: replace-cross Abstract: Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize …

COVERAGE [1]

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

RELATED ENTITIES

RELATED TOPICS