English(EN) SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

新框架通过世界模型和多智能体系统增强VLM的空间推理能力

作者 PulseAugur 编辑部 · [7 个来源] · 2026-04-23 01:19

研究人员开发了World2VLM，一个新颖的训练框架，将生成式世界模型中的空间推理能力提炼到视觉语言模型（VLMs）中。该方法合成未来视图以提供结构化监督，使VLMs能够比依赖合成数据或推理时世界模型耦合的方法更有效地内化空间想象。World2VLM在各种空间推理基准测试中表现出持续的改进，优于现有方法。 AI

影响引入了增强VLM空间推理能力的新方法和基准，有可能提高它们在动态环境中的性能。

排序理由该集群包含多篇学术论文，介绍了用于视觉语言模型空间推理的新模型和基准。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。我们如何撰写摘要 →

报道来源 [7]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-29 17:48

World2VLM：将世界模型想象力提炼到VLMs中以实现动态空间推理

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial s…
arXiv cs.CV TIER_1 English(EN) · Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi, Yibin Huang, Shuo Ren, Zitao Liu, Jiajun Zhang · 2026-04-30 04:00

World2VLM：将世界模型想象力提炼到VLMs中以实现动态空间推理

arXiv:2604.26934v1 Announce Type: new Abstract: Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts add…
arXiv cs.CV TIER_1 English(EN) · Jiajun Zhang · 2026-04-29 17:48

World2VLM：将世界模型想象力提炼到VLMs中以实现动态空间推理

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial s…
arXiv cs.CV TIER_1 English(EN) · Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee · 2026-04-29 04:00

SpatiO：用于空间推理的视觉语言代理的自适应测试时编排

arXiv:2604.21190v2 Announce Type: replace Abstract: Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such…
arXiv cs.CV TIER_1 English(EN) · Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao · 2026-04-27 04:00

SpaMEM：通过具身环境中的感知-记忆整合来评估动态空间推理的基准

arXiv:2604.22409v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric …
arXiv cs.CV TIER_1 English(EN) · Xin Cao · 2026-04-24 10:06

SpaMEM：通过具身环境中的感知-记忆整合对动态空间推理进行基准测试

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We intr…
arXiv cs.CV TIER_1 English(EN) · Jungbeom Lee · 2026-04-23 01:19

SpatiO：用于空间推理的视觉语言代理的自适应测试时编排

Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric…

报道来源 [7]

World2VLM：将世界模型想象力提炼到VLMs中以实现动态空间推理

World2VLM：将世界模型想象力提炼到VLMs中以实现动态空间推理

World2VLM：将世界模型想象力提炼到VLMs中以实现动态空间推理

SpatiO：用于空间推理的视觉语言代理的自适应测试时编排

SpaMEM：通过具身环境中的感知-记忆整合来评估动态空间推理的基准

SpaMEM：通过具身环境中的感知-记忆整合对动态空间推理进行基准测试

SpatiO：用于空间推理的视觉语言代理的自适应测试时编排

相关实体

相关话题