New benchmarks test VLM spatial reasoning, robustness, and consistency
ByPulseAugur Editorial·[14 sources]·
Researchers have developed new benchmarks to evaluate the spatial reasoning capabilities of vision-language models (VLMs). ArchSIBench focuses on architectural space understanding, while Flat-Pack Bench assesses spatio-temporal reasoning in tasks like furniture assembly. SpaceDG addresses robustness by evaluating models under visual degradation, finding that current VLMs struggle with these challenges. Additionally, a framework called SAGE aims to improve spatial reasoning by enforcing geometric logic consistency.
AI
IMPACT
These benchmarks and methods aim to push the boundaries of VLM capabilities in understanding complex spatial relationships and real-world visual conditions.
RANK_REASON
Multiple research papers introduce new benchmarks and methods for evaluating and improving spatial reasoning in vision-language models.
arXiv:2505.23764v3 Announce Type: replace-cross Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-imag…
arXiv cs.AI
TIER_1English(EN)·Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka·
arXiv:2507.07644v4 Announce Type: replace Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, be…
arXiv cs.CL
TIER_1English(EN)·Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang·
arXiv:2505.17015v2 Announce Type: replace-cross Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-…
arXiv cs.AI
TIER_1English(EN)·Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang·
arXiv:2605.20837v1 Announce Type: cross Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive rese…
arXiv:2605.21625v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classificatio…
arXiv:2605.22536v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-w…
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, a…
SpaceDG dataset and benchmark evaluate multimodal language models' spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vis…
Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by thre…
arXiv cs.CV
TIER_1English(EN)·Zhenghao Chen, Huiqun Wang, Di Huang·
arXiv:2604.03318v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by i…
arXiv:2605.25334v1 Announce Type: new Abstract: Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large langua…
arXiv:2605.25524v1 Announce Type: new Abstract: Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraint…
Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness …