PulseAugur
EN
LIVE 17:52:16

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using a novel pipeline to identify and collect videos with underrepresented concepts, and a caption-centric approach for high-quality annotation. Additionally, a Self-Correction Chain-of-Thought (CoT) training method is proposed, which leverages MLLMs' understanding capabilities to refine predictions, showing state-of-the-art performance on existing benchmarks and the new OmniVTG dataset. AI

IMPACT New datasets and training paradigms may improve the ability of multimodal models to accurately localize video segments based on text queries.

RANK_REASON This cluster contains two academic papers detailing new datasets and training methodologies for video temporal grounding.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

COVERAGE [3]

  1. arXiv cs.CV TIER_1 English(EN) · Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu ·

    OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    arXiv:2604.25276v1 Announce Type: new Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts…

  2. arXiv cs.CV TIER_1 English(EN) · Yang Liu ·

    OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce Om…

  3. arXiv cs.CV TIER_1 English(EN) · Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu ·

    Multi-Scale Contrastive Learning for Video Temporal Grounding

    arXiv:2412.07157v3 Announce Type: replace Abstract: Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a mu…