PulseAugur
实时 06:39:55

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using a novel pipeline to identify and collect videos with underrepresented concepts, and a caption-centric approach for high-quality annotation. Additionally, a Self-Correction Chain-of-Thought (CoT) training method is proposed, which leverages MLLMs' understanding capabilities to refine predictions, showing state-of-the-art performance on existing benchmarks and the new OmniVTG dataset. AI

影响 New datasets and training paradigms may improve the ability of multimodal models to accurately localize video segments based on text queries.

排序理由 This cluster contains two academic papers detailing new datasets and training methodologies for video temporal grounding.

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

报道来源 [3]

  1. arXiv cs.CV TIER_1 English(EN) · Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu ·

    OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    arXiv:2604.25276v1 Announce Type: new Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts…

  2. arXiv cs.CV TIER_1 English(EN) · Yang Liu ·

    OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce Om…

  3. arXiv cs.CV TIER_1 English(EN) · Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu ·

    Multi-Scale Contrastive Learning for Video Temporal Grounding

    arXiv:2412.07157v3 Announce Type: replace Abstract: Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a mu…