Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Researchers have developed new methods for training vision-language models (VLMs) in radiology. One approach introduces RefRad2D, a large dataset of 1.2 million CT and MR image-text pairs, used to train a model called RadGrounder that can generate reports, answer questions, and perform spatial grounding. Another study reveals that some chest radiography VLMs may not require image input to achieve high accuracy, with text-only models performing comparably to multimodal ones on certain tasks. This highlights the need for grounding audits to ensure models are truly interpreting medical images rather than relying on text priors. AI
IMPACT Highlights potential for more reliable AI in medical imaging by questioning reliance on image data and emphasizing grounding audits.