Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
Researchers have developed a new benchmark and data synthesis framework to improve the error recovery capabilities of GUI agents. The benchmark, GUI-RobustEval, includes over 1,200 test cases to systematically measure how well agents can recover from their own mistakes. Additionally, a framework called RoTS generates 800,000 data points to train agents on diverse error modes and their corresponding recovery steps. Models fine-tuned with this data, such as RoTS-32B, have shown significant performance gains and achieved state-of-the-art results on benchmarks like OSWorld. AI
IMPACT Enhances the reliability of AI agents by improving their ability to recover from self-induced errors, potentially accelerating real-world deployment.