PulseAugur
EN
LIVE 02:52:14

Anthropic's Claude 4.8 performance declines on hard prompt benchmark

Anthropic's Claude 4.8 model shows a decline in performance on the "Hard Prompts English" benchmark, according to user observations on Reddit. The latest version, 4.8, has fallen behind its predecessor, Claude 4.6, and even 4.7, in this specific evaluation. This benchmark is noted for its perceived resistance to "benchmaxxing" and is considered by some users to better reflect real-world performance. AI

IMPACT Performance regressions in leading models, even on specific benchmarks, highlight the challenges in maintaining consistent AI capabilities as models evolve.

RANK_REASON User commentary on a benchmark leaderboard showing performance degradation of a specific model version.

Read on r/singularity →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/singularity TIER_2 English(EN) · /u/LegitimateLength1916 ·

    Opus 4.8 Thinking keeps deteroriating on Hard Prompts English in LMArena (again)

    <!-- SC_OFF --><div class="md"><p>Opus 4.6 Thinking keeps the #1 spot.</p> <p>Followed by Opus 4.7 Thinking (-15 points).</p> <p>Lastly, Opus 4.8 Thinking (-23 points compared to 4.6 Thinking).</p> <p><a href="https://arena.ai/leaderboard/text/hard-prompts-english">https://arena.…