Medical VLM benchmarks show pretraining contamination, study finds

By PulseAugur Editorial · [1 sources] · 2026-06-10 04:00

Researchers have audited public medical vision-language benchmarks for pretraining contamination, finding measurable image-side overlap on the SLAKE-En benchmark with models like SigLIP-B-16. Text analysis revealed canonical-order exchangeability signals in Qwen2.5-VL on SLAKE-En and other VLMs on OmniMedVQA. However, the study concluded that certain detection methods, like cohort-relative tail enrichment, are unreliable for small medical VLM cohorts. AI

IMPACT Highlights potential flaws in current VLM evaluation methods, necessitating more robust auditing for reliable medical AI development.

RANK_REASON The cluster contains an academic paper detailing research findings on AI model evaluation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Medical VLM benchmarks show pretraining contamination, study finds

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Bruce Changlong Xu, Lan Wu, Alexander Ryu · 2026-06-10 04:00

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

arXiv:2606.10066v1 Announce Type: cross Abstract: Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We…

COVERAGE [1]

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

RELATED ENTITIES

RELATED TOPICS