Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 4d · [3 sources]

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

A new benchmark called Workflow-GYM has been introduced to evaluate AI agents on complex, long-horizon tasks within professional software environments. Current AI agents demonstrate significant limitations in handling these real-world workflows, with even the most advanced models achieving success rates just above 30%. The research highlights issues such as inconsistent workflow execution, error propagation, and a lack of understanding of specialized professional software, indicating a need for substantial advancements in agent capabilities. AI

IMPACT Highlights significant limitations in current AI agents for professional tasks, guiding future research in agentic AI.

AI agents
Workflow-GYM