A new paper evaluates five leading vision-language models (VLMs) on their trustworthiness for medical visual question answering (VQA). The study found significant limitations in the models' ability to accurately localize anatomical targets and a tendency for laterality confusion, with the best model achieving only 0.23 mean IoU. Integrating localization into a pipeline further degraded performance, highlighting grounding as a key bottleneck. While domain adaptation shows promise for improving VQA accuracy, the perception and trustworthiness issues remain. AI
IMPACT Identifies critical perception and grounding failures in frontier VLMs for medical applications, suggesting domain adaptation is needed to improve trustworthiness.
RANK_REASON Academic paper evaluating frontier models on a specific task.
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →