SWE-bench creator John Yang discusses AI coding agent evaluation and future benchmarks

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

John Yang, creator of the SWE-bench coding evaluation benchmark, discussed the evolving landscape of AI code evaluation in a recent podcast. Initially overlooked, SWE-bench gained industry standard status following the launch of Cognition's Devin AI agent, with major labs adopting it for assessing AI coding capabilities. Yang highlighted the expansion of SWE-bench to include multiple languages and the development of new benchmarks like CodeClash, which focuses on long-horizon development and agent tournaments. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Discussion of a widely adopted benchmark for evaluating AI coding agents and its variants.

Read on Latent Space Podcast →

SWE-bench creator John Yang discusses AI coding agent evaluation and future benchmarks

COVERAGE [1]

Latent Space Podcast TIER_1 · Latent.Space · 2025-12-31 16:00

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark bec…

COVERAGE [1]

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

RELATED TOPICS