PulseAugur
LIVE 07:45:10
research · [5 sources] ·
0
research

OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposure to problems and solutions during training rather than genuine advancements in software engineering skills. OpenAI found that a significant portion of the benchmark's tests incorrectly reject valid solutions, and that many models can reproduce ground-truth solutions verbatim, indicating training data overlap. The company now recommends SWE-bench Pro for evaluations and is developing new, uncontaminated benchmarks. AI

Summary written by gemini-2.5-flash-lite from 5 sources. How we write summaries →

RANK_REASON OpenAI's announcement about discontinuing the use of a specific benchmark (SWE-bench Verified) and recommending an alternative (SWE-bench Pro) due to contamination and flawed tests.

Read on Latent Space Podcast →

OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

COVERAGE [5]

  1. OpenAI News TIER_1 ·

    Why we no longer evaluate SWE-bench Verified

    SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

  2. OpenAI News TIER_1 ·

    Introducing SWE-bench Verified

    We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.

  3. Latent Space Podcast TIER_1 · Latent.Space ·

    ⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

    <p>Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (<a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/" target="_blank">https://openai.com/index…

  4. Mastodon — mastodon.social TIER_1 · [email protected] ·

    SWE-bench Verified no longer measures frontier coding capabilities Article URL: https:// openai.com/index/why-we-no-lon ger-evaluate-swe-bench-verified/ Comment

    SWE-bench Verified no longer measures frontier coding capabilities Article URL: https:// openai.com/index/why-we-no-lon ger-evaluate-swe-bench-verified/ Comments URL: https:// news.ycombinator.com/item?id=4 7910388 Points: 233 # Comments: 137 https:// openai.com/index/why-we-no-l…

  5. Mastodon — mastodon.social TIER_1 · [email protected] ·

    Why SWE-bench Verified no longer measures frontier coding capabilities https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ # HackerNews # Tec

    Why SWE-bench Verified no longer measures frontier coding capabilities https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ # HackerNews # Tech # AI