A new benchmark called DeepSWE has been developed to more accurately assess the coding capabilities of frontier AI models. Unlike previous benchmarks, DeepSWE is contamination-free, with tasks created from scratch to avoid models having seen solutions during pretraining. It features high diversity across 91 repositories and five languages, presenting real-world complexity with longer solutions and more output tokens than existing benchmarks. The benchmark also employs reliable, hand-written verifiers to test software behavior, aiming to reflect actual performance in software engineering tasks. AI
IMPACT Provides a more realistic evaluation of AI coding agents, potentially guiding future model development and application.
RANK_REASON New benchmark paper released for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →