OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

By PulseAugur Editorial · [3 sources] · 2026-04-28 04:00

Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using a novel pipeline to identify and collect videos with underrepresented concepts, and a caption-centric approach for high-quality annotation. Additionally, a Self-Correction Chain-of-Thought (CoT) training method is proposed, which leverages MLLMs' understanding capabilities to refine predictions, showing state-of-the-art performance on existing benchmarks and the new OmniVTG dataset. AI

IMPACT New datasets and training paradigms may improve the ability of multimodal models to accurately localize video segments based on text queries.

RANK_REASON This cluster contains two academic papers detailing new datasets and training methodologies for video temporal grounding.

Read on arXiv cs.CV →

paper
other

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

arXiv cs.CV TIER_1 English(EN) · Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu · 2026-04-29 04:00

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

arXiv:2604.25276v1 Announce Type: new Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts…
arXiv cs.CV TIER_1 English(EN) · Yang Liu · 2026-04-28 06:34

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce Om…
arXiv cs.CV TIER_1 English(EN) · Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu · 2026-04-28 04:00

Multi-Scale Contrastive Learning for Video Temporal Grounding

arXiv:2412.07157v3 Announce Type: replace Abstract: Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a mu…

COVERAGE [3]

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Multi-Scale Contrastive Learning for Video Temporal Grounding

RELATED ENTITIES

RELATED TOPICS