New benchmark reveals critical weaknesses in VLMs for rare medical anatomy

By PulseAugur Editorial · [1 sources] · 2026-06-26 04:00

A new benchmark, AdversarialAnatomyBench, has been introduced to evaluate vision-language models (VLMs) on rare anatomical variants in medical imaging. Testing 25 state-of-the-art VLMs revealed a significant drop in accuracy from 71% on typical anatomy to 28% on atypical presentations. Even top models like GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick experienced performance declines of 41-51%, indicating a critical limitation in their generalization to rare medical cases. The research suggests that neither model scaling nor bias-aware prompting effectively resolves these issues, highlighting the need for improved multimodal AI systems in healthcare. AI

IMPACT Highlights critical generalization failures in VLMs for medical applications, necessitating further research into robustness for rare cases.

RANK_REASON Academic paper introducing a new benchmark and evaluation of existing models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark reveals critical weaknesses in VLMs for rare medical anatomy

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Leon Mayer, Piotr Kalinowski, Caroline Ebersbach, Marcel Knopp, Tim R\"adsch, Evangelia Christodoulou, Annika Reinke, Fiona R. Kolbinger, Lena Maier-Hein · 2026-06-26 04:00

6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

arXiv:2512.04238v2 Announce Type: replace Abstract: Vision-language models (VLMs) are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare var…

COVERAGE [1]

6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

RELATED ENTITIES

RELATED TOPICS