PulseAugur
实时 08:40:20

MLLMs show foundational visual gaps despite progress in multimodal reasoning

A new paper introduces a method to improve latent reasoning in multimodal large language models (MLLMs) by optimizing visual latents at inference time, addressing a pathology where their contribution is suppressed. Separately, another study reveals significant foundational visual gaps in current MLLMs, even frontier models like GPT and Gemini, using a new benchmark called VisFactor. This benchmark, based on human cognitive psychology assessments, highlights consistent failures in tasks like spatial relation inference and figure-ground discrimination, suggesting current MLLM performance may not reflect true visual cognition. AI

影响 Highlights critical visual reasoning deficits in MLLMs, suggesting current benchmarks may overstate capabilities and prompting a need for more robust evaluation methods.

排序理由 Two arXiv papers present novel research on multimodal large language models, one proposing a new optimization technique and the other introducing a new benchmark for evaluating visual cognition.

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

MLLMs show foundational visual gaps despite progress in multimodal reasoning

报道来源 [4]

  1. arXiv cs.CL TIER_1 English(EN) · Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet ·

    What MLLMs Learn about When they Learn about Multimodal Reasoning

    arXiv:2510.01719v4 Announce Type: replace Abstract: Evaluation of multimodal reasoning models is typically reduced to a single accuracy score, implicitly treating reasoning as a unitary capability. We introduce MathLens, a benchmark of textbook-style geometry problems that expose…

  2. arXiv cs.LG TIER_1 English(EN) · Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou ·

    Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    arXiv:2605.02735v1 Announce Type: new Abstract: Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a p…

  3. arXiv cs.LG TIER_1 English(EN) · Joey Tianyi Zhou ·

    Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in e…

  4. arXiv cs.CV TIER_1 English(EN) · Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan ·

    Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

    arXiv:2502.16435v4 Announce Type: replace Abstract: Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstr…