Vision Language Models Cannot Reason About Physical Transformation
A new research paper published on arXiv highlights significant limitations in current Vision Language Models (VLMs) regarding their understanding of physical transformations. The study introduced ConservationBench, a dataset designed to test whether VLMs can grasp the principle of conservation, where physical quantities remain invariant during transformations. Across 112 VLMs and over 23,000 questions, the models performed at near-chance levels, indicating a fundamental failure to maintain consistent representations of physical properties. AI
IMPACT Current VLMs struggle with fundamental physical reasoning, suggesting a need for new architectures or training methods to achieve robust embodied AI capabilities.