A new benchmark called Agents' Last Exam (ALE) has been introduced to evaluate AI agents on long-horizon, economically valuable tasks in real-world professional domains. Developed with over 250 industry experts, ALE covers non-physical industries and includes over 1,000 tasks across 13 industry clusters. Current results indicate that even advanced AI agents struggle with these complex tasks, achieving an average full pass rate of only 2.6%. The benchmark is designed to be a living instrument, continuously expanding its task pool to bridge the gap between AI performance on benchmarks and its actual economic impact. AI
IMPACT Aims to better measure AI's real-world economic value and guide development towards practical applications.
RANK_REASON This is a research paper introducing a new benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →