Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 1d · [2 sources]

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Researchers have introduced MedVIGIL, a new evaluation suite designed to test the trustworthiness of medical vision-language models (VLMs). The suite focuses on how well these models recognize when visual evidence is compromised or misleading, a critical factor for clinical use. MedVIGIL includes 300 cases, meticulously curated and annotated by board-certified radiologists, to assess model performance under various forms of broken visual evidence. The benchmark revealed a significant gap between human performance and current models, with the strongest audited model, Claude Opus 4.7, scoring considerably lower than the independent radiologist baseline. AI

IMPACT Establishes a new benchmark for evaluating the trustworthiness of medical AI, highlighting current model limitations in recognizing compromised visual evidence.

Claude Opus 4.7
MedVIGIL
Junhao Chen