PulseAugur
实时 02:39:29
English(EN) From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

新基准测试挑战多模态大语言模型(MLLMs)的空间和功能推理能力

研究人员引入了新的基准测试,用于评估多模态大语言模型(MLLMs)的空间和功能推理能力。这些基准测试旨在超越基本的几何感知,评估结构化空间推理和理解物体在特定情境下的效用等更高级的认知能力。实验表明,当前的多模态大语言模型在整合空间记忆、功能推理和外部知识方面存在困难,这凸显了实现具身智能的重大瓶颈。 AI

影响 新的基准测试将推动更具认知能力的多模态智能体的开发,提升它们与现实世界的交互和规划能力。

排序理由 多篇arXiv论文介绍了用于评估多模态大语言模型空间和功能智能的新基准测试和模型。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

新基准测试挑战多模态大语言模型(MLLMs)的空间和功能推理能力

报道来源 [4]

  1. Apple Machine Learning Research TIER_1 English(EN) ·

    From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

    True spatial intelligence for multimodal agents transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks, such as VSI-Bench, effectively evaluate this foundational geometric stage, they fall sh…

  2. arXiv cs.AI TIER_1 English(EN) · Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, Yunjian Zhang ·

    SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    arXiv:2511.21471v4 Announce Type: replace Abstract: Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existin…

  3. arXiv cs.LG TIER_1 English(EN) · Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan ·

    Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    arXiv:2605.04128v1 Announce Type: cross Abstract: We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM…

  4. arXiv cs.CV TIER_1 English(EN) · Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal, Bo-Hsiang Tseng ·

    From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    arXiv:2605.02130v1 Announce Type: new Abstract: Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception c…