PulseAugur
实时 09:22:25

New benchmark tests GUI agents on dynamic short-video platforms

Researchers have introduced LivingScreen, a new benchmark designed to evaluate GUI agents on dynamic, short-video platforms. Unlike existing agents that assume static screens, LivingScreen agents must operate in environments where content continuously plays, requiring decisions on observation timing and duration. Evaluations of current frontier models revealed that none matched human performance in accuracy and cost-efficiency, with common failures including excessive or insufficient observation, highlighting a need for improved observation control in future GUI agents. AI

影响 This benchmark highlights a critical gap in current GUI agents' ability to handle dynamic environments, potentially guiding future research towards more adaptive and efficient AI systems.

排序理由 The cluster contains an academic paper introducing a new benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo ·

    Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

    arXiv:2606.04701v1 Announce Type: cross Abstract: GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must d…