Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?
A new research paper investigates whether vision-language models truly understand 3D spatial relationships or merely catalogue objects. Researchers developed a benchmark with over 3,000 samples to test depth-ordered occlusion, optical-geometry inference, and volumetric rearrangement planning. The study found that while models excel at planning rearrangements, they perform poorly on occlusion and reflection-based spatial reasoning, indicating a dissociation in their understanding. AI
IMPACT Highlights limitations in current vision-language models' understanding of 3D space, suggesting areas for future research and development.