PulseAugur / Brief
EN
LIVE 10:38:24

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

    Researchers have introduced DeskCraft, a new benchmark designed to evaluate desktop agents on complex, long-horizon professional tasks and human-in-the-loop collaboration. This benchmark includes tasks in creative and engineering software, requiring over 50 execution steps and formalizing interaction protocols for mid-turn and post-turn exchanges. Initial evaluations showed that GPT-5.4 achieved 31.6% on standard tasks and 27.6% on interactive tasks, highlighting persistent challenges in long-horizon workflow execution and proactive clarification. AI

    IMPACT This benchmark will drive development of more capable desktop AI agents for complex, real-world professional tasks.