A new benchmark, AdversarialAnatomyBench, has been introduced to evaluate vision-language models (VLMs) on rare anatomical variants in medical imaging. Testing 25 state-of-the-art VLMs revealed a significant drop in accuracy from 71% on typical anatomy to 28% on atypical presentations. Even top models like GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick experienced performance declines of 41-51%, indicating a critical limitation in their generalization to rare medical cases. The research suggests that neither model scaling nor bias-aware prompting effectively resolves these issues, highlighting the need for improved multimodal AI systems in healthcare. AI
IMPACT Highlights critical generalization failures in VLMs for medical applications, necessitating further research into robustness for rare cases.
RANK_REASON Academic paper introducing a new benchmark and evaluation of existing models. [lever_c_demoted from research: ic=1 ai=1.0]
- AdversarialAnatomyBench
- arXiv
- Gemini 2.5 Pro
- GPT-5
- Hugging Face
- Leon Mayer
- Llama 4 Maverick
- vision-language model
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →