Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [2 sources]

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

Researchers have developed a new benchmark and data synthesis framework to improve the error recovery capabilities of GUI agents. The benchmark, GUI-RobustEval, includes over 1,200 test cases to systematically measure how well agents can recover from their own mistakes. Additionally, a framework called RoTS generates 800,000 data points to train agents on diverse error modes and their corresponding recovery steps. Models fine-tuned with this data, such as RoTS-32B, have shown significant performance gains and achieved state-of-the-art results on benchmarks like OSWorld. AI

IMPACT Enhances the reliability of AI agents by improving their ability to recover from self-induced errors, potentially accelerating real-world deployment.

Hugging Face
arXiv
OSWorld
GUI agents
RoTS
Robustness-driven Trajectory Synthesis
RoTS-32B
GUI-RobustEval
RoTS-7B