A user-created benchmark, ObviousBench, has revealed a performance regression in Anthropic's Opus 4.7 model compared to its predecessor, Opus 4.6. The benchmark, designed to test models on simple reasoning errors, showed that Opus 4.7 required a significantly higher configuration setting to achieve a lower score than Opus 4.6. The creator suggests that Opus 4.7's overconfidence and reduced reasoning token usage may be contributing to this apparent step backward in performance. AI
IMPACT Suggests potential issues with model versioning and performance consistency, prompting further investigation into Anthropic's model development.
RANK_REASON User-created benchmark reveals performance regression in a specific model version. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →