English(EN)From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
新基准测试挑战多模态大语言模型(MLLMs)的空间和功能推理能力
作者PulseAugur 编辑部·[4 个来源]·
研究人员引入了新的基准测试,用于评估多模态大语言模型(MLLMs)的空间和功能推理能力。这些基准测试旨在超越基本的几何感知,评估结构化空间推理和理解物体在特定情境下的效用等更高级的认知能力。实验表明,当前的多模态大语言模型在整合空间记忆、功能推理和外部知识方面存在困难,这凸显了实现具身智能的重大瓶颈。
AI
True spatial intelligence for multimodal agents transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks, such as VSI-Bench, effectively evaluate this foundational geometric stage, they fall sh…
arXiv cs.AI
TIER_1English(EN)·Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, Yunjian Zhang·
arXiv:2511.21471v4 Announce Type: replace Abstract: Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existin…
arXiv cs.LG
TIER_1English(EN)·Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan·
arXiv:2605.04128v1 Announce Type: cross Abstract: We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM…
arXiv:2605.02130v1 Announce Type: new Abstract: Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception c…