PulseAugur
EN
LIVE 11:35:59

New DeskCraft benchmark tests AI agents on complex professional tasks

Researchers have introduced DeskCraft, a new benchmark designed to evaluate desktop agents on complex, long-horizon professional tasks and human-in-the-loop collaboration. This benchmark includes tasks in creative and engineering software, requiring over 50 execution steps and formalizing interaction protocols for mid-turn and post-turn exchanges. Initial evaluations showed that GPT-5.4 achieved 31.6% on standard tasks and 27.6% on interactive tasks, highlighting persistent challenges in long-horizon workflow execution and proactive clarification. AI

IMPACT This benchmark will drive development of more capable desktop AI agents for complex, real-world professional tasks.

RANK_REASON The cluster contains a research paper introducing a new benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang ·

    DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

    arXiv:2606.03103v1 Announce Type: new Abstract: Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users pro…