A new paper introduces a method to improve latent reasoning in multimodal large language models (MLLMs) by optimizing visual latents at inference time, addressing a pathology where their contribution is suppressed. Separately, another study reveals significant foundational visual gaps in current MLLMs, even frontier models like GPT and Gemini, using a new benchmark called VisFactor. This benchmark, based on human cognitive psychology assessments, highlights consistent failures in tasks like spatial relation inference and figure-ground discrimination, suggesting current MLLM performance may not reflect true visual cognition. AI
影响 Highlights critical visual reasoning deficits in MLLMs, suggesting current benchmarks may overstate capabilities and prompting a need for more robust evaluation methods.
排序理由 Two arXiv papers present novel research on multimodal large language models, one proposing a new optimization technique and the other introducing a new benchmark for evaluating visual cognition.
AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →