MLLMs show foundational visual gaps despite progress in multimodal reasoning

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 4 sources

A new paper introduces a method to improve latent reasoning in multimodal large language models (MLLMs) by optimizing visual latents at inference time, addressing a pathology where their contribution is suppressed. Separately, another study reveals significant foundational visual gaps in current MLLMs, even frontier models like GPT and Gemini, using a new benchmark called VisFactor. This benchmark, based on human cognitive psychology assessments, highlights consistent failures in tasks like spatial relation inference and figure-ground discrimination, suggesting current MLLM performance may not reflect true visual cognition. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT Highlights critical visual reasoning deficits in MLLMs, suggesting current benchmarks may overstate capabilities and prompting a need for more robust evaluation methods.

RANK_REASON Two arXiv papers present novel research on multimodal large language models, one proposing a new optimization technique and the other introducing a new benchmark for evaluating visual cognition.

Read on arXiv cs.CV →

COVERAGE [4]

arXiv cs.CL TIER_1 · Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet · 2026-05-08 04:00

What MLLMs Learn about When they Learn about Multimodal Reasoning

arXiv:2510.01719v4 Announce Type: replace Abstract: Evaluation of multimodal reasoning models is typically reduced to a single accuracy score, implicitly treating reasoning as a unitary capability. We introduce MathLens, a benchmark of textbook-style geometry problems that expose…
arXiv cs.LG TIER_1 · Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou · 2026-05-05 04:00

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

arXiv:2605.02735v1 Announce Type: new Abstract: Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a p…
arXiv cs.LG TIER_1 · Joey Tianyi Zhou · 2026-05-04 15:36

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in e…
arXiv cs.CV TIER_1 · Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan · 2026-05-05 04:00

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

arXiv:2502.16435v4 Announce Type: replace Abstract: Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstr…

COVERAGE [4]

What MLLMs Learn about When they Learn about Multimodal Reasoning

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

RELATED ENTITIES

RELATED TOPICS