Researchers have introduced DeskCraft, a new benchmark designed to evaluate desktop agents on complex, long-horizon professional tasks and human-in-the-loop collaboration. This benchmark includes tasks in creative and engineering software, requiring over 50 execution steps and formalizing interaction protocols for mid-turn and post-turn exchanges. Initial evaluations showed that GPT-5.4 achieved 31.6% on standard tasks and 27.6% on interactive tasks, highlighting persistent challenges in long-horizon workflow execution and proactive clarification. AI
IMPACT This benchmark will drive development of more capable desktop AI agents for complex, real-world professional tasks.
RANK_REASON The cluster contains a research paper introducing a new benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →