Researchers have introduced HCAST, a new benchmark comprising 189 tasks in machine learning engineering, cybersecurity, and software development. This benchmark includes human performance data, with tasks ranging from one minute to over eight hours for skilled individuals. Evaluations show current AI agents achieve 70-80% success on tasks under one hour but drop to below 20% for tasks exceeding four hours. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON The cluster is based on an academic paper introducing a new benchmark for AI task performance.