PulseAugur
实时 23:11:22

New training method combats 'lazy perception' in vision-language models

Researchers have introduced a new training paradigm called "Starve to Perceive" to address the issue of "lazy perception" in Vision-Language Models (VLMs). This phenomenon occurs when VLMs can achieve adequate accuracy using coarse visual inputs and language priors, thus lacking a true incentive to learn active visual search strategies like zooming or cropping. The "Starve to Perceive" method constrains the visual bandwidth, limiting each observation to a small token budget, which forces the model to engage in active perception for task completion. This minimal, plug-in modification to existing training pipelines resulted in an average relative improvement of 5% across various benchmarks without requiring architectural changes or auxiliary losses. AI

影响 This research introduces a method to improve the active perception capabilities of VLMs, potentially leading to more effective agents in complex visual environments.

排序理由 The cluster contains an academic paper detailing a new training methodology for existing models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New training method combats 'lazy perception' in vision-language models

报道来源 [1]

  1. arXiv cs.CV TIER_1 English(EN) · Haozhe Wang ·

    Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

    Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models…