PulseAugur
实时 08:09:12

New MedVIGIL benchmark tests medical AI's trustworthiness

Researchers have introduced MedVIGIL, a new benchmark designed to evaluate the trustworthiness of medical vision-language models (VLMs). The benchmark focuses on a model's ability to recognize when visual evidence is insufficient or misleading, a critical aspect for clinical applications. MedVIGIL includes 300 cases with expert-authored questions, answers, and risk assessments, and was used to test 16 VLMs and 2 text-only models. The strongest audited model, Claude Opus 4.7, scored 69.2 on the MedVIGIL Composite Score, significantly below the independent radiologist's score of 83.3. AI

影响 This benchmark will help developers create more reliable medical AI systems by focusing on their ability to handle broken or misleading visual evidence.

排序理由 The cluster describes the release of a new academic paper introducing a novel benchmark for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New MedVIGIL benchmark tests medical AI's trustworthiness

报道来源 [1]

  1. arXiv cs.CV TIER_1 English(EN) · Xiang Li ·

    MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

    Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbe…