A new paper evaluates five leading vision-language models (VLMs) on their trustworthiness for medical visual question answering (VQA). The study found significant limitations in the models' ability to accurately localize anatomical targets and a tendency for laterality confusion, with the best model achieving only 0.23 mean IoU. Integrating localization into a pipeline further degraded performance, highlighting grounding as a key bottleneck. While domain adaptation shows promise for improving VQA accuracy, the perception and trustworthiness issues remain. AI
影响 Identifies critical perception and grounding failures in frontier VLMs for medical applications, suggesting domain adaptation is needed to improve trustworthiness.
排序理由 Academic paper evaluating frontier models on a specific task.
AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →