CompART training improves VLM multi-object grounding and visual understanding

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-08 04:00

Researchers have developed a new training method called Compositional Attention-Regularized Training (CompART) to improve how Vision-Language Models (VLMs) handle complex, multi-object references. Current VLMs struggle with grounding performance when phrases involve multiple objects, largely due to training objectives that focus on image-caption alignment. CompART addresses this by decomposing captions into object-centric phrases and constructing composite phrases, encouraging the model's attention to balance across these components for better localization. AI

影响 Introduces a novel training technique to enhance VLM capabilities in understanding and localizing multiple objects within complex visual references.

排序理由 This is a research paper detailing a new training methodology for existing models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Jiayun Luo, Mir Rayat Imtiaz Hossain, Pritam Sarkar, Boyang Li, Leonid Sigal · 2026-05-08 04:00

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

arXiv:2412.08110v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that ground…

报道来源 [1]

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

相关实体

相关话题