Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
A new research paper introduces the SpatialUncertain framework to evaluate vision-language models (VLMs) on their ability to recognize when they cannot answer spatial questions due to occlusion or misleading perspectives. The study found that current frontier VLMs are prone to overconfidence, answering incorrectly about 70% of the time under occlusion and over 90% under perspective ambiguity. Furthermore, many models struggle to identify which additional viewpoints would be necessary to resolve such ambiguities, highlighting a need to assess VLM uncertainty and evidence-seeking capabilities beyond mere answer correctness. AI
IMPACT Highlights critical limitations in VLM spatial reasoning and uncertainty awareness, pushing for new evaluation methods beyond simple accuracy.