New training method combats 'lazy perception' in vision-language models

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-18 16:13

Researchers have introduced a new training paradigm called "Starve to Perceive" to address the issue of "lazy perception" in Vision-Language Models (VLMs). This phenomenon occurs when VLMs can achieve adequate accuracy using coarse visual inputs and language priors, thus lacking a true incentive to learn active visual search strategies like zooming or cropping. The "Starve to Perceive" method constrains the visual bandwidth, limiting each observation to a small token budget, which forces the model to engage in active perception for task completion. This minimal, plug-in modification to existing training pipelines resulted in an average relative improvement of 5% across various benchmarks without requiring architectural changes or auxiliary losses. AI

影响 This research introduces a method to improve the active perception capabilities of VLMs, potentially leading to more effective agents in complex visual environments.

排序理由 The cluster contains an academic paper detailing a new training methodology for existing models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CV TIER_1 English(EN) · Haozhe Wang · 2026-05-18 16:13

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models…

报道来源 [1]

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

相关实体

相关话题