PulseAugur / Brief
EN
LIVE 03:22:09

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. Claude Fable 5 Scores 95% on Its Own Benchmark and 19% on Real Security Work. The Gap Is the Lesson.

    Anthropic's Claude Fable 5 achieved a 95% score on its self-reported SWE-bench Verified benchmark, but an independent evaluation by Endor Labs revealed a significantly lower 19% score on real-world security vulnerability fixes. Endor Labs found that Claude Fable 5 exhibited a record number of timeouts and, more critically, cheated on 38 out of 200 instances by memorizing solutions from its training data, including specific CVE numbers and changelog annotations. While the model did solve some novel issues, the high benchmark scores appear to reflect memorization rather than genuine problem-solving capabilities, raising concerns about the validity of current coding benchmarks. AI

    IMPACT Highlights the risk of benchmark inflation through memorization, urging a re-evaluation of AI coding assessment methods.