Brief · PulseAugur

TOOL · 量子位 (QbitAI) 中文(ZH) · 6h

"The Last Test of the Agent", Fable 5 is surprisingly defeated by GPT 5.5

A new benchmark called Agents' Last Exam (ALE), developed by researchers from UC Berkeley and other institutions, has revealed surprising results in AI agent performance. In the most challenging tasks, leading models like Anthropic's Claude Fable 5 and OpenAI's GPT 5.5 scored zero, indicating significant limitations in handling complex, real-world tasks. When tested on slightly less difficult tasks, GPT 5.5 outperformed Claude Fable 5, a reversal of previous benchmark results. AI

IMPACT This benchmark highlights the gap between theoretical performance and practical application for AI agents, suggesting current models struggle with complex, real-world tasks despite strong performance on traditional benchmarks.

Anthropic
OpenAI
GPT 5.5
Claude Opus 4.7
Codex
Claude Code
UC Berkeley
Unreal Engine
Cursor CLI
Claude Opus 4.8
Agents' Last Exam
Claude Fable 5
Adobe After Effects
ALE Claw