Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using a novel pipeline to identify and collect videos with underrepresented concepts, and a caption-centric approach for high-quality annotation. Additionally, a Self-Correction Chain-of-Thought (CoT) training method is proposed, which leverages MLLMs' understanding capabilities to refine predictions, showing state-of-the-art performance on existing benchmarks and the new OmniVTG dataset. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT New datasets and training paradigms may improve the ability of multimodal models to accurately localize video segments based on text queries.
RANK_REASON This cluster contains two academic papers detailing new datasets and training methodologies for video temporal grounding.