PulseAugur
实时 13:19:26

New benchmark CUActSpot targets complex interactions for AI agents

Researchers have introduced CUActSpot, a new benchmark designed to evaluate computer-use agents (CUAs) on complex and infrequent interactions across multiple modalities. The benchmark addresses the long-tail issue in GUI operations where a few complex interactions cause most task failures, hypothesizing this is due to data scarcity. Their proposed data-synthesis pipeline generates scenes, records interactions, and uses an LLM to create instructions and action traces, leading to their Phi-Ground-Any-4B model outperforming larger open-source models. AI

影响 This benchmark aims to improve the reliability of AI agents for complex tasks, potentially increasing user trust and adoption in real-world applications.

排序理由 The cluster contains an academic paper introducing a new benchmark and model. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New benchmark CUActSpot targets complex interactions for AI agents

报道来源 [1]

  1. arXiv cs.CV TIER_1 English(EN) · Baining Guo ·

    Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operat…