OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

作者 PulseAugur 编辑部 · [5 个来源] · 2024-08-13 10:00

OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposure to problems and solutions during training rather than genuine advancements in software engineering skills. OpenAI found that a significant portion of the benchmark's tests incorrectly reject valid solutions, and that many models can reproduce ground-truth solutions verbatim, indicating training data overlap. The company now recommends SWE-bench Pro for evaluations and is developing new, uncontaminated benchmarks. AI

排序理由 OpenAI's announcement about discontinuing the use of a specific benchmark (SWE-bench Verified) and recommending an alternative (SWE-bench Pro) due to contamination and flawed tests.

在 Latent Space Podcast 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

报道来源 [5]

OpenAI News TIER_1 English(EN) · 2026-02-23 11:00

Why we no longer evaluate SWE-bench Verified

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
OpenAI News TIER_1 English(EN) · 2024-08-13 10:00

Introducing SWE-bench Verified

We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.
Latent Space Podcast TIER_1 English(EN) · Latent.Space · 2026-02-23 20:03

⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

<p>Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (<a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/" target="_blank">https://openai.com/index…
Mastodon — mastodon.social TIER_1 English(EN) · [email protected] · 2026-04-27 00:37

SWE-bench Verified no longer measures frontier coding capabilities Article URL: https:// openai.com/index/why-we-no-lon ger-evaluate-swe-bench-verified/ Comment

SWE-bench Verified no longer measures frontier coding capabilities Article URL: https:// openai.com/index/why-we-no-lon ger-evaluate-swe-bench-verified/ Comments URL: https:// news.ycombinator.com/item?id=4 7910388 Points: 233 # Comments: 137 https:// openai.com/index/why-we-no-l…
Mastodon — mastodon.social TIER_1 English(EN) · [email protected] · 2026-04-26 13:58

Why SWE-bench Verified no longer measures frontier coding capabilities https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ # HackerNews # Tec

Why SWE-bench Verified no longer measures frontier coding capabilities https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ # HackerNews # Tech # AI

报道来源 [5]

Why we no longer evaluate SWE-bench Verified

Introducing SWE-bench Verified

⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

SWE-bench Verified no longer measures frontier coding capabilities Article URL: https:// openai.com/index/why-we-no-lon ger-evaluate-swe-bench-verified/ Comment

Why SWE-bench Verified no longer measures frontier coding capabilities https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ # HackerNews # Tec

相关实体

相关话题