English(EN)SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
新框架通过世界模型和多智能体系统增强VLM的空间推理能力
作者PulseAugur 编辑部·[7 个来源]·
研究人员开发了World2VLM,一个新颖的训练框架,将生成式世界模型中的空间推理能力提炼到视觉语言模型(VLMs)中。该方法合成未来视图以提供结构化监督,使VLMs能够比依赖合成数据或推理时世界模型耦合的方法更有效地内化空间想象。World2VLM在各种空间推理基准测试中表现出持续的改进,优于现有方法。
AI
Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial s…
arXiv:2604.26934v1 Announce Type: new Abstract: Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts add…
Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial s…
arXiv cs.CV
TIER_1English(EN)·Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee·
arXiv:2604.21190v2 Announce Type: replace Abstract: Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such…
arXiv:2604.22409v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric …
Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We intr…
Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric…