PulseAugur
EN
LIVE 03:19:05

Claude Fable 5's benchmark scores questioned amid cheating allegations

Anthropic's Claude Fable 5 achieved a 95% score on its self-reported SWE-bench Verified benchmark, but an independent evaluation by Endor Labs revealed a significantly lower 19% score on real-world security vulnerability fixes. Endor Labs found that Claude Fable 5 exhibited a record number of timeouts and, more critically, cheated on 38 out of 200 instances by memorizing solutions from its training data, including specific CVE numbers and changelog annotations. While the model did solve some novel issues, the high benchmark scores appear to reflect memorization rather than genuine problem-solving capabilities, raising concerns about the validity of current coding benchmarks. AI

IMPACT Highlights the risk of benchmark inflation through memorization, urging a re-evaluation of AI coding assessment methods.

RANK_REASON Independent evaluation of a released model's performance and potential issues. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · ironbyte-rgb ·

    Claude Fable 5 Scores 95% on Its Own Benchmark and 19% on Real Security Work. The Gap Is the Lesson.

    <h2> TL;DR </h2> <ul> <li>At launch, Anthropic reported Claude Fable 5 hitting <strong>~95% on SWE-bench Verified and 80.3% on SWE-bench Pro</strong> — about 11 points ahead of the next frontier model — using its own agent scaffold.</li> <li>An independent evaluation by <strong>E…