Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
A new benchmark called Workflow-GYM has been introduced to evaluate AI agents on complex, long-horizon tasks within professional software environments. Current AI agents demonstrate significant limitations in handling these real-world workflows, with even the most advanced models achieving success rates just above 30%. The research highlights issues such as inconsistent workflow execution, error propagation, and a lack of understanding of specialized professional software, indicating a need for substantial advancements in agent capabilities. AI
IMPACT Highlights significant limitations in current AI agents for professional tasks, guiding future research in agentic AI.