OSWorld
PulseAugur coverage of OSWorld — every cluster mentioning OSWorld across labs, papers, and developer communities, ranked by signal.
5 day(s) with sentiment data
-
New RL framework uses vision-language models for GUI agent supervision
Researchers have developed a new reinforcement learning framework for Computer-Use Agents (CUAs) that leverages autonomous vision-language evaluation for supervision. This approach addresses the challenge of obtaining s…
-
AI agents achieve 66% success on desktop tasks, but data gaps remain a challenge
Computer-use agents have shown significant progress, with success rates on the OSWorld benchmark jumping from 12% to 66% in about a year. This rapid advancement was highlighted by Microsoft's Build 2026 keynote, which p…
-
New benchmarks probe AI agent safety against deceptive interfaces and unsafe actions
Two new research papers introduce benchmarks for evaluating the safety of AI agents. OSGuard focuses on computer-use agents, distinguishing between safe and unsafe actions and identifying latent hazards in task executio…
-
New ProCUA-SFT dataset boosts AI agent desktop performance
Researchers have developed ProCUA-SFT, a new dataset designed to improve the training of computer-use agents (CUAs) that interact with graphical desktop environments. Existing datasets like AgentNet have shown negative …
-
AI Memory Systems Can Harm Performance, Research Finds
New research indicates that AI memory systems, while intended to improve user experience and task completion, can paradoxically degrade model performance and foster sycophantic tendencies. Studies show that these system…
-
New MacArena benchmark tests AI agents on macOS
Researchers have developed MacArena, a new benchmark designed to evaluate computer-use agents (CUAs) operating within a macOS environment. This benchmark includes 421 tasks across 50 applications, specifically tailored …
-
Hcompany ships Holo3.1 agents for fast, local computer use
Hcompany has released Holo3.1, a new family of computer-use agents designed for robust performance across various environments and agent frameworks. This release emphasizes local inference capabilities, offering quantiz…
-
New benchmark and data synthesis boost GUI agent error recovery
Researchers have developed a new benchmark and data synthesis framework to improve the error recovery capabilities of GUI agents. The benchmark, GUI-RobustEval, includes over 1,200 test cases to systematically measure h…
-
New frameworks aim to improve AI understanding of user intent
Two new research papers introduce computational frameworks for understanding and controlling user intent in AI interactions. The first, 'Intent Signal Theory,' formalizes the distinction between a user's latent intent a…
-
AI safety evaluations face 'safe-to-dangerous shift' challenge
A fundamental challenge in AI safety is the "safe-to-dangerous shift," which complicates realistic evaluations of AI models. This shift arises because alignment evaluations must be safe, limiting AI capabilities, while …
-
Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H
Researchers have introduced A11y-Compressor, a framework designed to make GUI agent observations more efficient by transforming linearized accessibility trees into structured representations. This method reduces input t…