Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [3 sources]

Towards One-to-Many Temporal Grounding

Researchers have introduced a new benchmark and dataset for One-to-Many Temporal Grounding (OMTG), a task that involves localizing multiple video segments corresponding to a single text query. Existing multimodal large language models (MLLMs) struggle with OMTG due to a lack of event cardinality perception. The proposed solution includes novel temporal and caption reward functions, utilizing Chain-of-Thought reasoning for improved preciseness and completeness. Experiments demonstrate a new state-of-the-art Effective Temporal F1 score of 43.65%, significantly outperforming models like Gemini 2.5 Pro and Seed-1.8. AI

IMPACT Establishes a new benchmark and dataset for multi-segment video retrieval, pushing the capabilities of MLLMs in complex temporal grounding tasks.

Gemini 2.5 Pro
Seed-1.8