Researchers have introduced SWE-fficiency, a new benchmark designed to evaluate the performance optimization capabilities of language models on real-world software repositories. The benchmark includes 498 tasks across nine popular data-science, machine-learning, and HPC repositories, such as NumPy and Pandas. It challenges agents to analyze code, identify performance bottlenecks, and propose patches that match or exceed expert speedups while passing all unit tests. Initial evaluations show that current state-of-the-art agents significantly underperform, achieving less than 0.23x the expert speedup due to difficulties in localization, cross-function reasoning, and maintaining code correctness. AI
IMPACT This benchmark could accelerate research into LLMs capable of complex, long-horizon reasoning for software performance optimization.
RANK_REASON The cluster contains a research paper detailing a new benchmark for evaluating LLMs on software engineering tasks. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →