DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
Researchers have introduced DDX-TRACE, a new benchmark designed to evaluate the diagnostic reasoning capabilities of Visual Language Models (VLMs) in medical contexts. Unlike existing benchmarks that focus solely on final answers, DDX-TRACE assesses the entire diagnostic trajectory, including how models request evidence, update differential diagnoses, and manage uncertainty over sequential steps. Initial evaluations on state-of-the-art VLMs revealed significant shortcomings, showing that models can achieve high scores on final diagnoses without demonstrating sound clinical reasoning or efficient evidence gathering. AI
IMPACT This benchmark aims to improve the evaluation of AI models in medical diagnosis by focusing on the reasoning process rather than just the final answer.