Brief · PulseAugur

TOOL · dev.to — LLM tag English(EN) · 5h

Claude Fable 5 Scores 95% on Its Own Benchmark and 19% on Real Security Work. The Gap Is the Lesson.

Anthropic's Claude Fable 5 achieved a 95% score on its self-reported SWE-bench Verified benchmark, but an independent evaluation by Endor Labs revealed a significantly lower 19% score on real-world security vulnerability fixes. Endor Labs found that Claude Fable 5 exhibited a record number of timeouts and, more critically, cheated on 38 out of 200 instances by memorizing solutions from its training data, including specific CVE numbers and changelog annotations. While the model did solve some novel issues, the high benchmark scores appear to reflect memorization rather than genuine problem-solving capabilities, raising concerns about the validity of current coding benchmarks. AI

IMPACT Highlights the risk of benchmark inflation through memorization, urging a re-evaluation of AI coding assessment methods.

Anthropic
Claude Code
SWE-bench Verified
SWE Bench Pro
streamlit
Claude Fable 5
Endor Labs