Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?
A new study published on arXiv explores the use of Large Language Models (LLMs) as a cost-effective alternative to human expert panels for evaluating medical AI systems. The research introduces an "LLM Jury" comprising three frontier models to score diagnoses and clinical reasoning across real-world hospital cases. The findings indicate that while uncalibrated LLM scores are lower than expert scores, they maintain ordinal agreement and show a lower probability of severe-risk errors. Crucially, a calibrated LLM Jury, combined with LLM-generated diagnoses, can effectively identify high-risk errors, enabling targeted expert review and improving panel efficiency without exhibiting self-preference bias. AI
IMPACT Calibrated LLM Juries could significantly reduce the cost and time required for medical AI system evaluation, accelerating their development and deployment.