Anthropic's Claude Opus 4.8 has achieved a score of over 1% on the ARC-AGI 3 benchmark. This marks a significant milestone as it is the first time any AI model has surpassed this threshold on the challenging evaluation. The ARC-AGI benchmark is designed to test an AI's ability to perform abstract reasoning tasks, making this achievement notable for the field. AI
IMPACT Sets a new benchmark for abstract reasoning capabilities in LLMs, potentially influencing future model development.
RANK_REASON New model version release with benchmark performance. [lever_c_demoted from frontier_release: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →