PulseAugur
EN
LIVE 12:18:05

V-Zero framework enables label-free visual reasoning, boosting training speed

Researchers have introduced V-Zero, a novel framework for fine-grained visual reasoning that operates without requiring annotated answer labels. This method utilizes contrastive evidence gating to enhance the model's ability to identify task-relevant visual evidence and ground reasoning in specific image regions. V-Zero achieves significantly faster training times, reportedly over 5 times faster than supervised fine-tuning and more than 10 times faster than reinforcement learning baselines, by pairing question-relevant crops with negative visual views to evaluate and gate distillation. AI

IMPACT This label-free approach could significantly reduce the cost and time associated with training visual reasoning models.

RANK_REASON The cluster describes a new research paper detailing a novel framework for visual reasoning.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

V-Zero framework enables label-free visual reasoning, boosting training speed

COVERAGE [2]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

    A novel label-free framework for visual reasoning called V-Zero is presented, which uses contrastive evidence gating to improve fine-grained visual reasoning without requiring annotated answer labels, achieving faster training than traditional methods.

  2. arXiv cs.CV TIER_1 English(EN) · Haoxiang Sun, Zhihang Yi, Langxuan Deng, Yuhao Zhou, Peiqi Jia, Jian Zhao, Li Yuan, Jiancheng Lv, Tao Wang ·

    V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

    arXiv:2606.25319v1 Announce Type: new Abstract: Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement le…