Researchers have introduced a new benchmark and dataset for One-to-Many Temporal Grounding (OMTG), a task that involves localizing multiple video segments corresponding to a single text query. Existing multimodal large language models (MLLMs) struggle with OMTG due to a lack of event cardinality perception. The proposed solution includes novel temporal and caption reward functions, utilizing Chain-of-Thought reasoning for improved preciseness and completeness. Experiments demonstrate a new state-of-the-art Effective Temporal F1 score of 43.65%, significantly outperforming models like Gemini 2.5 Pro and Seed-1.8. AI
IMPACT Establishes a new benchmark and dataset for multi-segment video retrieval, pushing the capabilities of MLLMs in complex temporal grounding tasks.
RANK_REASON The cluster contains a research paper introducing a new benchmark, dataset, and model for a specific AI task.
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →