MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
Researchers have developed MacArena, a new benchmark designed to evaluate computer-use agents (CUAs) operating within a macOS environment. This benchmark includes 421 tasks across 50 applications, specifically tailored for Apple Silicon and utilizing Apple's native Virtualization framework. MacArena aims to address the limitations of existing benchmarks, which often focus on Linux-based systems and may not accurately reflect the unique challenges presented by macOS GUIs. Initial evaluations indicate that performance on MacArena can differ significantly from other benchmarks, with some leading models showing a substantial drop in competence on macOS-native tasks. AI
IMPACT This benchmark could drive the development of more versatile AI agents capable of navigating diverse operating system environments.