PulseAugur
实时 01:25:43
English(EN) CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

新的基准测试评估视觉语言模型的空间推理、鲁棒性和一致性

研究人员开发了新的基准测试来评估视觉语言模型(VLMs)的空间推理能力。ArchSIBench 专注于建筑空间理解,而 Flat-Pack Bench 评估家具组装等任务中的时空推理能力。SpaceDG 通过在视觉退化条件下评估模型来解决鲁棒性问题,发现当前的 VLMs 在应对这些挑战时存在困难。此外,一个名为 SAGE 的框架旨在通过强制执行几何逻辑一致性来改进空间推理。 AI

影响 这些基准测试和方法旨在推动视觉语言模型在理解复杂空间关系和真实世界视觉条件方面的能力边界。

排序理由 多篇研究论文介绍了用于评估和改进视觉语言模型空间推理能力的新基准测试和方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。 我们如何撰写摘要 →

新的基准测试评估视觉语言模型的空间推理、鲁棒性和一致性

报道来源 [9]

  1. arXiv cs.CL TIER_1 English(EN) · Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang ·

    Multi-SpatialMLLM:使用多模态大语言模型进行多帧空间理解

    arXiv:2505.17015v2 Announce Type: replace-cross Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-…

  2. arXiv cs.CL TIER_1 English(EN) · Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan ·

    平板式长凳:通过家具组装评估大型视觉语言模型中的时空理解能力

    arXiv:2605.21625v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classificatio…

  3. arXiv cs.AI TIER_1 English(EN) · Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang ·

    ArchSIBench:评估视觉语言模型架构空间智能的基准测试

    arXiv:2605.20837v1 Announce Type: cross Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive rese…

  4. arXiv cs.CL TIER_1 English(EN) · Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong ·

    SpaceDG:视觉退化下的空间智能基准测试

    arXiv:2605.22536v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-w…

  5. arXiv cs.CL TIER_1 English(EN) · Zhihang Zhong ·

    SpaceDG:视觉降质下的空间智能基准测试

    Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, a…

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    SpaceDG:在视觉降级下对空间智能进行基准测试

    SpaceDG dataset and benchmark evaluate multimodal language models' spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.

  7. arXiv cs.AI TIER_1 English(EN) · Weixin Huang ·

    ArchSIBench:评估视觉语言模型架构空间智能的基准测试

    Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vis…

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    CrossView Suite:利用数据集、模型和基准实现多模态大语言模型(MLLMs)的跨视图空间智能

    Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by thre…

  9. arXiv cs.CV TIER_1 English(EN) · Ding Wang ·

    通过几何逻辑一致性实现视觉语言模型中的自演化空间推理

    Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness …