Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 10h

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

Researchers have developed MedicalAgentsBench, a new benchmark designed to evaluate complex medical reasoning in large language models. The benchmark, comprising 862 clinical questions, compares internalized reasoning models against externalized agent-based frameworks. Findings indicate that both approaches independently enhance performance, and their combination yields the best results, with the o3-mini model paired with the MDAgents framework achieving the highest accuracy. AI

IMPACT This benchmark could drive improvements in AI's ability to handle complex medical reasoning, potentially aiding in clinical decision support.

arXiv
OpenAI o1-mini
DeepSeek-R1
MedicalAgentsBench
MDAgents
Xiangru Tang