MedCTA: A Benchmark for Clinical Tool Agents
Researchers have introduced MedCTA, a new benchmark designed to evaluate the capabilities of AI agents in clinical settings. This benchmark focuses on tasks requiring planning, tool retrieval, and evidence acquisition, moving beyond simple recognition or single-turn question answering. MedCTA includes 107 real-world clinical tasks with clinician-verified trajectories across five deployed tools, assessing aspects like tool selection, execution stability, and outcome quality. Initial benchmarking of 18 models revealed that even advanced systems struggle with multi-step clinical tool use, exhibiting issues with protocol failures and incorrect tool recruitment. AI
IMPACT Highlights limitations in current clinical AI agents' ability to reliably use tools, indicating a need for improved agentic behavior in healthcare.