Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 8h

Position: The Systemic Lack of Agency in Visual Reasoning

A new paper argues that current vision-language models (VLMs) suffer from a systemic lack of agency, hindering their implicit reasoning capabilities. The authors propose that VLMs tend to perform passive semantic retrieval rather than active, situated reasoning, which is crucial for human visual understanding. To address this, they introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD) to measure this missing quadrant, finding that even prominent VLMs struggle with autonomous visual exploration and attending to self-directed inquiry. AI

IMPACT Highlights a critical gap in current VLMs, potentially guiding future research towards more autonomous and exploratory AI systems.

arXiv
Vision-Language Models
Visual Implicit Reasoning Diagnosing Benchmark