Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1w

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Researchers have developed MedCUA-Bench, a new benchmark designed to evaluate the performance of computer-use agents in clinical settings. This benchmark focuses on screenshot-only interactions with medical software, addressing the limitations of existing tools that do not adequately represent the unique interfaces and safety requirements of healthcare applications. MedCUA-Bench includes 18 clinical scenarios across 10 medical domains, reconstructed from real-world materials to ensure authenticity while respecting privacy and licensing. Initial evaluations show that even the top-performing closed-source models struggle, achieving only 54.2% strict success, with open-source models performing significantly worse, highlighting a substantial gap in current agent capabilities for reliable clinical software use. AI

IMPACT Highlights the significant challenges AI agents face in real-world clinical applications, indicating a need for specialized development and evaluation.