New training method combats 'lazy perception' in vision-language models

By PulseAugur Editorial · [1 sources] · 2026-05-18 16:13

Researchers have introduced a new training paradigm called "Starve to Perceive" to address the issue of "lazy perception" in Vision-Language Models (VLMs). This phenomenon occurs when VLMs can achieve adequate accuracy using coarse visual inputs and language priors, thus lacking a true incentive to learn active visual search strategies like zooming or cropping. The "Starve to Perceive" method constrains the visual bandwidth, limiting each observation to a small token budget, which forces the model to engage in active perception for task completion. This minimal, plug-in modification to existing training pipelines resulted in an average relative improvement of 5% across various benchmarks without requiring architectural changes or auxiliary losses. AI

IMPACT This research introduces a method to improve the active perception capabilities of VLMs, potentially leading to more effective agents in complex visual environments.

RANK_REASON The cluster contains an academic paper detailing a new training methodology for existing models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New training method combats 'lazy perception' in vision-language models

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Haozhe Wang · 2026-05-18 16:13

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models…

COVERAGE [1]

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

RELATED ENTITIES

RELATED TOPICS