New benchmarks test VLM spatial reasoning, robustness, and consistency

By PulseAugur Editorial · [19 sources] · 2026-05-18 10:05

Researchers have developed new benchmarks to evaluate the spatial reasoning capabilities of vision-language models (VLMs). ArchSIBench focuses on architectural space understanding, while Flat-Pack Bench assesses spatio-temporal reasoning in tasks like furniture assembly. SpaceDG addresses robustness by evaluating models under visual degradation, finding that current VLMs struggle with these challenges. Additionally, a framework called SAGE aims to improve spatial reasoning by enforcing geometric logic consistency. AI

IMPACT These benchmarks and methods aim to push the boundaries of VLM capabilities in understanding complex spatial relationships and real-world visual conditions.

RANK_REASON Multiple research papers introduce new benchmarks and methods for evaluating and improving spatial reasoning in vision-language models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 19 sources. How we write summaries →

COVERAGE [19]

arXiv cs.CL TIER_1 English(EN) · Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim · 2026-05-27 04:00

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

arXiv:2509.21552v2 Announce Type: replace-cross Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent…
arXiv cs.AI TIER_1 English(EN) · Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka · 2026-05-26 04:00

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

arXiv:2507.07644v4 Announce Type: replace Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, be…
arXiv cs.CL TIER_1 English(EN) · Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang · 2026-05-26 04:00

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

arXiv:2505.23764v3 Announce Type: replace-cross Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-imag…
arXiv cs.MA (Multiagent) TIER_1 English(EN) · Chuang Gan · 2026-05-25 18:04

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized emb…
arXiv cs.CL TIER_1 English(EN) · Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang · 2026-05-25 04:00

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

arXiv:2505.17015v2 Announce Type: replace-cross Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-…
arXiv cs.CL TIER_1 English(EN) · Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong · 2026-05-22 04:00

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

arXiv:2605.22536v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-w…
arXiv cs.CL TIER_1 English(EN) · Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan · 2026-05-22 04:00

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

arXiv:2605.21625v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classificatio…
arXiv cs.AI TIER_1 English(EN) · Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang · 2026-05-22 04:00

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

arXiv:2605.20837v1 Announce Type: cross Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive rese…
arXiv cs.CL TIER_1 English(EN) · Zhihang Zhong · 2026-05-21 14:25

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, a…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG dataset and benchmark evaluate multimodal language models' spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.
arXiv cs.AI TIER_1 English(EN) · Weixin Huang · 2026-05-20 07:27

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vis…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-18 16:31

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by thre…
arXiv cs.CV TIER_1 English(EN) · Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue · 2026-05-27 04:00

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

arXiv:2510.09606v2 Announce Type: replace Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This pape…
arXiv cs.CV TIER_1 English(EN) · Xiangye Lin, Hongxin Zhang, Ruxi Deng, Qinhong Zhou, Chuang Gan · 2026-05-27 04:00

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

arXiv:2605.26239v1 Announce Type: new Abstract: In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challen…
arXiv cs.CV TIER_1 English(EN) · Qian Yang, Ankur Sikarwar, Huy Le, Le Zhang, Zhuan Shi, Perouz Taslakian, Aishwarya Agrawal · 2026-05-27 04:00

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

arXiv:2605.27310v1 Announce Type: new Abstract: Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an int…
arXiv cs.CV TIER_1 English(EN) · Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha · 2026-05-26 04:00

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

arXiv:2605.25334v1 Announce Type: new Abstract: Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large langua…
arXiv cs.CV TIER_1 English(EN) · Zhenghao Chen, Huiqun Wang, Di Huang · 2026-05-26 04:00

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

arXiv:2604.03318v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by i…
arXiv cs.CV TIER_1 English(EN) · Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong · 2026-05-26 04:00

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

arXiv:2605.25524v1 Announce Type: new Abstract: Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraint…
arXiv cs.CV TIER_1 English(EN) · Ding Wang · 2026-05-18 10:05

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness …

COVERAGE [19]

RELATED ENTITIES

RELATED TOPICS