DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Researchers have introduced DeskCraft, a new benchmark designed to evaluate desktop agents on complex, long-horizon professional tasks and human-in-the-loop collaboration. This benchmark includes tasks in creative and engineering software, requiring over 50 execution steps and formalizing interaction protocols for mid-turn and post-turn exchanges. Initial evaluations showed that GPT-5.4 achieved 31.6% on standard tasks and 27.6% on interactive tasks, highlighting persistent challenges in long-horizon workflow execution and proactive clarification. AI
IMPACT This benchmark will drive development of more capable desktop AI agents for complex, real-world professional tasks.