PulseAugur
EN
LIVE 14:43:46

New benchmarks test VLM spatial reasoning, robustness, and consistency

Researchers have developed new benchmarks to evaluate the spatial reasoning capabilities of vision-language models (VLMs). ArchSIBench focuses on architectural space understanding, while Flat-Pack Bench assesses spatio-temporal reasoning in tasks like furniture assembly. SpaceDG addresses robustness by evaluating models under visual degradation, finding that current VLMs struggle with these challenges. Additionally, a framework called SAGE aims to improve spatial reasoning by enforcing geometric logic consistency. AI

IMPACT These benchmarks and methods aim to push the boundaries of VLM capabilities in understanding complex spatial relationships and real-world visual conditions.

RANK_REASON Multiple research papers introduce new benchmarks and methods for evaluating and improving spatial reasoning in vision-language models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 14 sources. How we write summaries →

New benchmarks test VLM spatial reasoning, robustness, and consistency

COVERAGE [14]

  1. arXiv cs.CL TIER_1 English(EN) · Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang ·

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    arXiv:2505.23764v3 Announce Type: replace-cross Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-imag…

  2. arXiv cs.AI TIER_1 English(EN) · Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka ·

    FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

    arXiv:2507.07644v4 Announce Type: replace Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, be…

  3. arXiv cs.CL TIER_1 English(EN) · Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang ·

    Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

    arXiv:2505.17015v2 Announce Type: replace-cross Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-…

  4. arXiv cs.AI TIER_1 English(EN) · Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang ·

    ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

    arXiv:2605.20837v1 Announce Type: cross Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive rese…

  5. arXiv cs.CL TIER_1 English(EN) · Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan ·

    Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

    arXiv:2605.21625v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classificatio…

  6. arXiv cs.CL TIER_1 English(EN) · Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong ·

    SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

    arXiv:2605.22536v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-w…

  7. arXiv cs.CL TIER_1 English(EN) · Zhihang Zhong ·

    SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

    Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, a…

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

    SpaceDG dataset and benchmark evaluate multimodal language models' spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.

  9. arXiv cs.AI TIER_1 English(EN) · Weixin Huang ·

    ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

    Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vis…

  10. Hugging Face Daily Papers TIER_1 English(EN) ·

    CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

    Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by thre…

  11. arXiv cs.CV TIER_1 English(EN) · Zhenghao Chen, Huiqun Wang, Di Huang ·

    EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    arXiv:2604.03318v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by i…

  12. arXiv cs.CV TIER_1 English(EN) · Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha ·

    Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

    arXiv:2605.25334v1 Announce Type: new Abstract: Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large langua…

  13. arXiv cs.CV TIER_1 English(EN) · Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong ·

    ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

    arXiv:2605.25524v1 Announce Type: new Abstract: Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraint…

  14. arXiv cs.CV TIER_1 English(EN) · Ding Wang ·

    Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

    Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness …