PulseAugur
LIVE 06:25:24
research · [4 sources] ·
0
research

MLLMs show foundational visual gaps despite progress in multimodal reasoning

A new paper introduces a method to improve latent reasoning in multimodal large language models (MLLMs) by optimizing visual latents at inference time, addressing a pathology where their contribution is suppressed. Separately, another study reveals significant foundational visual gaps in current MLLMs, even frontier models like GPT and Gemini, using a new benchmark called VisFactor. This benchmark, based on human cognitive psychology assessments, highlights consistent failures in tasks like spatial relation inference and figure-ground discrimination, suggesting current MLLM performance may not reflect true visual cognition. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT Highlights critical visual reasoning deficits in MLLMs, suggesting current benchmarks may overstate capabilities and prompting a need for more robust evaluation methods.

RANK_REASON Two arXiv papers present novel research on multimodal large language models, one proposing a new optimization technique and the other introducing a new benchmark for evaluating visual cognition.

Read on arXiv cs.CV →

COVERAGE [4]

  1. arXiv cs.CL TIER_1 · Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet ·

    What MLLMs Learn about When they Learn about Multimodal Reasoning

    arXiv:2510.01719v4 Announce Type: replace Abstract: Evaluation of multimodal reasoning models is typically reduced to a single accuracy score, implicitly treating reasoning as a unitary capability. We introduce MathLens, a benchmark of textbook-style geometry problems that expose…

  2. arXiv cs.LG TIER_1 · Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou ·

    Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    arXiv:2605.02735v1 Announce Type: new Abstract: Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a p…

  3. arXiv cs.LG TIER_1 · Joey Tianyi Zhou ·

    Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in e…

  4. arXiv cs.CV TIER_1 · Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan ·

    Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

    arXiv:2502.16435v4 Announce Type: replace Abstract: Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstr…