PulseAugur
EN
LIVE 11:39:15

New benchmark reveals AI agents pass only 2.6% of real-world tasks

A new benchmark called Agents' Last Exam (ALE) has been introduced to evaluate AI agents on complex, real-world tasks relevant to professional industries. Developed with over 250 industry experts, ALE encompasses over 1,000 tasks across 13 industry clusters, drawing from actual expert projects and utilizing the U.S. federal occupational taxonomy. Initial results indicate that current AI agents achieve only a 2.6% pass rate on the most challenging tier, highlighting a significant gap between AI capabilities and practical workplace automation. AI

IMPACT Highlights the gap between AI agent performance on benchmarks and real-world economic value, suggesting a longer timeline for widespread AI workplace automation.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. Hugging Face Daily Papers TIER_1 (CA) ·

    Agents' Last Exam

    Agents' Last Exam (ALE) is a benchmark for evaluating AI agents on long-term, economically valuable real-world tasks across 13 industry clusters with 1K+ tasks, revealing significant gaps between benchmark performance and practical deployment.