Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Researchers have identified a phenomenon called attention dispersion in multimodal large language models (MLLMs) that impairs their reasoning capabilities, particularly in visual question answering tasks. This occurs when the model's visual attention scatters away from relevant regions during complex reasoning processes. To address this, a new training-free framework called Visual Region-Guided Attention (VRGA) has been proposed, which reweights attention to keep the model focused on crucial visual elements. AI
IMPACT Mitigates a key limitation in multimodal LLMs, potentially improving their reliability in visual reasoning tasks.