PulseAugur
LIVE 13:04:56
research · [1 source] ·
0
research

SWE-bench creator John Yang discusses AI coding agent evaluation and future benchmarks

John Yang, creator of the SWE-bench coding evaluation benchmark, discussed the evolving landscape of AI code evaluation in a recent podcast. Initially overlooked, SWE-bench gained industry standard status following the launch of Cognition's Devin AI agent, with major labs adopting it for assessing AI coding capabilities. Yang highlighted the expansion of SWE-bench to include multiple languages and the development of new benchmarks like CodeClash, which focuses on long-horizon development and agent tournaments. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Discussion of a widely adopted benchmark for evaluating AI coding agents and its variants.

Read on Latent Space Podcast →

SWE-bench creator John Yang discusses AI coding agent evaluation and future benchmarks

COVERAGE [1]

  1. Latent Space Podcast TIER_1 · Latent.Space ·

    [State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

    <p>From creating <strong>SWE-bench</strong> in a Princeton basement to shipping <strong>CodeClash</strong>, <strong>SWE-bench Multimodal</strong>, and <strong>SWE-bench Multilingual</strong>, <strong>John Yang</strong> has spent the last year and a half watching his benchmark bec…