PulseAugur
实时 10:54:55
English(EN) SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

新框架通过世界模型和多智能体系统增强VLM的空间推理能力

研究人员开发了World2VLM,一个新颖的训练框架,将生成式世界模型中的空间推理能力提炼到视觉语言模型(VLMs)中。该方法合成未来视图以提供结构化监督,使VLMs能够比依赖合成数据或推理时世界模型耦合的方法更有效地内化空间想象。World2VLM在各种空间推理基准测试中表现出持续的改进,优于现有方法。 AI

影响 引入了增强VLM空间推理能力的新方法和基准,有可能提高它们在动态环境中的性能。

排序理由 该集群包含多篇学术论文,介绍了用于视觉语言模型空间推理的新模型和基准。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

新框架通过世界模型和多智能体系统增强VLM的空间推理能力

报道来源 [7]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

    Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial s…

  2. arXiv cs.CV TIER_1 English(EN) · Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi, Yibin Huang, Shuo Ren, Zitao Liu, Jiajun Zhang ·

    World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

    arXiv:2604.26934v1 Announce Type: new Abstract: Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts add…

  3. arXiv cs.CV TIER_1 English(EN) · Jiajun Zhang ·

    World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

    Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial s…

  4. arXiv cs.CV TIER_1 English(EN) · Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee ·

    SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

    arXiv:2604.21190v2 Announce Type: replace Abstract: Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such…

  5. arXiv cs.CV TIER_1 English(EN) · Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao ·

    SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    arXiv:2604.22409v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric …

  6. arXiv cs.CV TIER_1 English(EN) · Xin Cao ·

    SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We intr…

  7. arXiv cs.CV TIER_1 English(EN) · Jungbeom Lee ·

    SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

    Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric…