New research tackles VLM spatial reasoning with geometric priors

By PulseAugur Editorial · [35 sources] · 2026-05-18 10:05

Researchers are developing new methods to improve the spatial reasoning capabilities of Vision-Language Models (VLMs), which currently struggle with 3D understanding. Several papers propose injecting geometric priors and structured reasoning into these models. Techniques include using orthographic views, geometric-aware spatial priors (GASP), hierarchical task decomposition, and point cloud data to enhance performance on spatial benchmarks. AI

IMPACT These advancements aim to equip AI models with more robust 3D spatial understanding, crucial for applications in robotics and embodied intelligence.

RANK_REASON Multiple academic papers proposing new methods and benchmarks for improving AI model capabilities.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 35 sources. How we write summaries →

New research tackles VLM spatial reasoning with geometric priors

COVERAGE [35]

arXiv cs.CL TIER_1 English(EN) · Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng · 2026-06-01 04:00

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

arXiv:2603.07751v2 Announce Type: replace-cross Abstract: Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intel…
arXiv cs.AI TIER_1 English(EN) · Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao · 2026-05-29 04:00

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

arXiv:2605.30231v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating spe…
arXiv cs.AI TIER_1 English(EN) · Fanyi Xiao · 2026-05-28 17:00

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible an…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 17:00

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible an…
arXiv cs.AI TIER_1 English(EN) · Siyi Lyu, Quan Liu, Feng Yan · 2026-05-28 04:00

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

arXiv:2601.03048v2 Announce Type: replace-cross Abstract: Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises…
arXiv cs.AI TIER_1 English(EN) · Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen, Sihong Xie · 2026-05-28 04:00

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

arXiv:2605.28144v1 Announce Type: new Abstract: LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Insp…
arXiv cs.AI TIER_1 English(EN) · Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao · 2026-05-28 04:00

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

arXiv:2605.28277v1 Announce Type: new Abstract: Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce Me…
arXiv cs.AI TIER_1 English(EN) · Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou · 2026-05-28 04:00

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

arXiv:2605.28490v1 Announce Type: cross Abstract: 3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style groundin…
arXiv cs.AI TIER_1 English(EN) · Weichen Zhang, Ruiying Peng, Xin Zeng, Jianjie Fang, Ziyou Wang, Kaiyuan Li, Heng Dong, Wei Li, Chen Gao, Xin Wang, Xinlei Chen, Yong Li · 2026-05-28 04:00

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

arXiv:2504.04540v2 Announce Type: replace-cross Abstract: 3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain u…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 00:00

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Training Vision-Language Models with geometric priors improves 3D spatial reasoning through deep supervision with contrastive loss and depth consistency, achieving better performance than standard fine-tuning approaches.
arXiv cs.CL TIER_1 English(EN) · Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim · 2026-05-27 04:00

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

arXiv:2509.21552v2 Announce Type: replace-cross Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent…
arXiv cs.AI TIER_1 English(EN) · Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka · 2026-05-26 04:00

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

arXiv:2507.07644v4 Announce Type: replace Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, be…
arXiv cs.CL TIER_1 English(EN) · Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang · 2026-05-26 04:00

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

arXiv:2505.23764v3 Announce Type: replace-cross Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-imag…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-26 00:00

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

Training intervention called View Dropout combined with panoramic visual thinking enables more effective cross-view spatial reasoning in unified multimodal models.
arXiv cs.MA (Multiagent) TIER_1 English(EN) · Chuang Gan · 2026-05-25 18:04

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized emb…
arXiv cs.CL TIER_1 English(EN) · Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang · 2026-05-25 04:00

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

arXiv:2505.17015v2 Announce Type: replace-cross Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-…
arXiv cs.AI TIER_1 English(EN) · Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang · 2026-05-22 04:00

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

arXiv:2605.20837v1 Announce Type: cross Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive rese…
arXiv cs.CL TIER_1 English(EN) · Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan · 2026-05-22 04:00

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

arXiv:2605.21625v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classificatio…
arXiv cs.CL TIER_1 English(EN) · Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong · 2026-05-22 04:00

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

arXiv:2605.22536v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-w…
arXiv cs.CL TIER_1 English(EN) · Zhihang Zhong · 2026-05-21 14:25

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, a…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG dataset and benchmark evaluate multimodal language models' spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.
arXiv cs.AI TIER_1 English(EN) · Weixin Huang · 2026-05-20 07:27

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vis…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-18 16:31

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by thre…
arXiv cs.CV TIER_1 English(EN) · Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong · 2026-05-29 04:00

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

arXiv:2605.29074v1 Announce Type: new Abstract: Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in emb…
arXiv cs.CV TIER_1 English(EN) · Haozhan Shen, Tiancheng Zhao, Kangjia Zhao, Jianwei Yin · 2026-05-28 04:00

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

arXiv:2605.28132v1 Announce Type: new Abstract: Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Visi…
arXiv cs.CV TIER_1 English(EN) · Xiaofang Zhou · 2026-05-27 13:45

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instructio…
arXiv cs.CV TIER_1 English(EN) · Jianwei Yin · 2026-05-27 08:20

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language su…
arXiv cs.CV TIER_1 English(EN) · Qian Yang, Ankur Sikarwar, Huy Le, Le Zhang, Zhuan Shi, Perouz Taslakian, Aishwarya Agrawal · 2026-05-27 04:00

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

arXiv:2605.27310v1 Announce Type: new Abstract: Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an int…
arXiv cs.CV TIER_1 English(EN) · Xiangye Lin, Hongxin Zhang, Ruxi Deng, Qinhong Zhou, Chuang Gan · 2026-05-27 04:00

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

arXiv:2605.26239v1 Announce Type: new Abstract: In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challen…
arXiv cs.CV TIER_1 English(EN) · Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue · 2026-05-27 04:00

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

arXiv:2510.09606v2 Announce Type: replace Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This pape…
arXiv cs.CV TIER_1 English(EN) · Aishwarya Agrawal · 2026-05-26 17:20

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows …
arXiv cs.CV TIER_1 English(EN) · Zhenghao Chen, Huiqun Wang, Di Huang · 2026-05-26 04:00

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

arXiv:2604.03318v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by i…
arXiv cs.CV TIER_1 English(EN) · Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong · 2026-05-26 04:00

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

arXiv:2605.25524v1 Announce Type: new Abstract: Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraint…
arXiv cs.CV TIER_1 English(EN) · Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha · 2026-05-26 04:00

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

arXiv:2605.25334v1 Announce Type: new Abstract: Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large langua…
arXiv cs.CV TIER_1 English(EN) · Ding Wang · 2026-05-18 10:05

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness …

COVERAGE [35]

RELATED ENTITIES

RELATED TOPICS