A new benchmark called Agents' Last Exam (ALE), developed by researchers from UC Berkeley and other institutions, has revealed surprising results in AI agent performance. In the most challenging tasks, leading models like Anthropic's Claude Fable 5 and OpenAI's GPT 5.5 scored zero, indicating significant limitations in handling complex, real-world tasks. When tested on slightly less difficult tasks, GPT 5.5 outperformed Claude Fable 5, a reversal of previous benchmark results. AI
IMPACT This benchmark highlights the gap between theoretical performance and practical application for AI agents, suggesting current models struggle with complex, real-world tasks despite strong performance on traditional benchmarks.
RANK_REASON The cluster describes a new benchmark for AI agents, detailing its methodology and initial results, which is a research-oriented contribution to the field. [lever_c_demoted from research: ic=1 ai=1.0]
- Adobe After Effects
- Agents' Last Exam
- ALE Claw
- Anthropic
- Claude Code
- Claude Fable 5
- Claude Opus 4.7
- Claude Opus 4.8
- Codex
- Cursor CLI
- GPT 5.5
- OpenAI
- UC Berkeley
- Unreal Engine
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →