Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H

By PulseAugur Editorial · Summary by None from 17 sources

Researchers have introduced A11y-Compressor, a framework designed to make GUI agent observations more efficient by transforming linearized accessibility trees into structured representations. This method reduces input tokens significantly while improving task success rates. Concurrently, a new benchmark called WindowsWorld has been developed to evaluate GUI agents on complex, multi-application professional workflows, revealing current agents' poor performance in such scenarios. Additionally, VLAA-GUI offers a modular framework to address challenges like early stopping and repetitive loops in autonomous GUI agents, incorporating components for verification, loop breaking, and online search. AI

Summary written by None from 17 sources. How we write summaries →

IMPACT New benchmarks and frameworks are emerging to push the capabilities of GUI agents in complex, real-world scenarios.

RANK_REASON Multiple arXiv papers introducing new frameworks, benchmarks, and methods for GUI agents.

Read on Hugging Face Blog →

Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H

COVERAGE [17]

Hugging Face Blog TIER_1 · 2025-06-03 13:27

Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H
arXiv cs.CL TIER_1 · Michito Takeshita, Takuro Kawada, Takumi Ohashi, Shunsuke Kitada, Hitoshi Iyatomi · 2026-05-04 04:00

A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction

arXiv:2605.00551v1 Announce Type: new Abstract: AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text-based format that encodes UI element attributes, but …
arXiv cs.CL TIER_1 · Hitoshi Iyatomi · 2026-05-01 10:16

A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction

AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text-based format that encodes UI element attributes, but it suffers from redundancy and lacks structural …
arXiv cs.AI TIER_1 · Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, Min Zhang · 2026-05-01 04:00

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

arXiv:2604.27776v1 Announce Type: new Abstract: While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordina…
arXiv cs.CL TIER_1 · Min Zhang · 2026-04-30 12:13

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish …
Hugging Face Daily Papers TIER_1 · 2026-04-28 08:43

Benchmarking and Improving GUI Agents in High-Dynamic Environments

Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents …
arXiv cs.CL TIER_1 · Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie · 2026-04-27 04:00

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

arXiv:2604.21375v2 Announce Type: replace Abstract: Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recov…
arXiv cs.CL TIER_1 · Cihang Xie · 2026-04-23 07:42

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic fram…
Hugging Face Daily Papers TIER_1 · 2026-04-23 07:42

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic fram…
arXiv cs.CV TIER_1 · Yan Zhang, Daiqing Wu, Huawen Shen, Yu Zhou, Can Ma · 2026-05-04 04:00

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

arXiv:2605.00642v1 Announce Type: cross Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO)…
arXiv cs.CV TIER_1 · Can Ma · 2026-05-01 13:23

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely o…
arXiv cs.CV TIER_1 · Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen · 2026-05-01 04:00

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

arXiv:2604.27974v1 Announce Type: new Abstract: Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreli…
arXiv cs.CV TIER_1 · Xiuying Chen · 2026-04-30 15:03

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and …
arXiv cs.CV TIER_1 · Enqi Liu, Liyuan Pan, Zhi Gao, Yan Yang, Chenrui Shi, Yang Liu, Jingrong Wu, Qing Li · 2026-04-29 04:00

Benchmarking and Improving GUI Agents in High-Dynamic Environments

arXiv:2604.25380v1 Announce Type: new Abstract: Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments…
arXiv cs.CV TIER_1 · Qing Li · 2026-04-28 08:43

Benchmarking and Improving GUI Agents in High-Dynamic Environments

Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents …
arXiv cs.CV TIER_1 · Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang · 2026-04-28 04:00

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

arXiv:2604.24441v1 Announce Type: new Abstract: Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a …
arXiv cs.CV TIER_1 · Zhaoxiang Zhang · 2026-04-27 13:06

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics an…

COVERAGE [17]

RELATED ENTITIES

RELATED TOPICS