New benchmarks test VLM spatial reasoning, robustness, and consistency

作者 PulseAugur 编辑部 · [9 个来源] · 2026-05-18 10:05

Researchers have developed new benchmarks to evaluate the spatial reasoning capabilities of vision-language models (VLMs). ArchSIBench focuses on architectural space understanding, while Flat-Pack Bench assesses spatio-temporal reasoning in tasks like furniture assembly. SpaceDG addresses robustness by evaluating models under visual degradation, finding that current VLMs struggle with these challenges. Additionally, a framework called SAGE aims to improve spatial reasoning by enforcing geometric logic consistency. AI

影响 These benchmarks and methods aim to push the boundaries of VLM capabilities in understanding complex spatial relationships and real-world visual conditions.

排序理由 Multiple research papers introduce new benchmarks and methods for evaluating and improving spatial reasoning in vision-language models.

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。我们如何撰写摘要 →

报道来源 [9]

arXiv cs.CL TIER_1 English(EN) · Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang · 2026-05-25 04:00

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

arXiv:2505.17015v2 Announce Type: replace-cross Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-…
arXiv cs.CL TIER_1 English(EN) · Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan · 2026-05-22 04:00

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

arXiv:2605.21625v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classificatio…
arXiv cs.AI TIER_1 English(EN) · Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang · 2026-05-22 04:00

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

arXiv:2605.20837v1 Announce Type: cross Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive rese…
arXiv cs.CL TIER_1 English(EN) · Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong · 2026-05-22 04:00

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

arXiv:2605.22536v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-w…
arXiv cs.CL TIER_1 English(EN) · Zhihang Zhong · 2026-05-21 14:25

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, a…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG dataset and benchmark evaluate multimodal language models' spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.
arXiv cs.AI TIER_1 English(EN) · Weixin Huang · 2026-05-20 07:27

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vis…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-18 16:31

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by thre…
arXiv cs.CV TIER_1 English(EN) · Ding Wang · 2026-05-18 10:05

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness …

报道来源 [9]

相关实体

相关话题