Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
Researchers have introduced LivingScreen, a new benchmark designed to evaluate GUI agents on dynamic, short-video platforms. Unlike existing agents that assume static screens, LivingScreen agents must operate in environments where content continuously plays, requiring decisions on observation timing and duration. Evaluations of current frontier models revealed that none matched human performance in accuracy and cost-efficiency, with common failures including excessive or insufficient observation, highlighting a need for improved observation control in future GUI agents. AI
IMPACT This benchmark highlights a critical gap in current GUI agents' ability to handle dynamic environments, potentially guiding future research towards more adaptive and efficient AI systems.