PulseAugur
EN
LIVE 07:42:03
中文(ZH) “智能体最后的考试”,Fable 5竟然不敌GPT 5.5

New Benchmark Shows GPT 5.5 Outperforming Claude Fable 5 on Real-World Tasks

A new benchmark called Agents' Last Exam (ALE), developed by researchers from UC Berkeley and other institutions, has revealed surprising results in AI agent performance. In the most challenging tasks, leading models like Anthropic's Claude Fable 5 and OpenAI's GPT 5.5 scored zero, indicating significant limitations in handling complex, real-world tasks. When tested on slightly less difficult tasks, GPT 5.5 outperformed Claude Fable 5, a reversal of previous benchmark results. AI

IMPACT This benchmark highlights the gap between theoretical performance and practical application for AI agents, suggesting current models struggle with complex, real-world tasks despite strong performance on traditional benchmarks.

RANK_REASON The cluster describes a new benchmark for AI agents, detailing its methodology and initial results, which is a research-oriented contribution to the field. [lever_c_demoted from research: ic=1 ai=1.0]

Read on 量子位 (QbitAI) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. 量子位 (QbitAI) TIER_1 中文(ZH) · 一水 ·

    "The Last Test of the Agent", Fable 5 is surprisingly defeated by GPT 5.5

    最难档通通零蛋