PulseAugur
EN
LIVE 21:34:06

New OMTG benchmark surpasses Gemini 2.5 Pro with novel reward functions

Researchers have introduced a new benchmark and dataset for One-to-Many Temporal Grounding (OMTG), a task that involves localizing multiple video segments corresponding to a single text query. Existing multimodal large language models (MLLMs) struggle with OMTG due to a lack of event cardinality perception. The proposed solution includes novel temporal and caption reward functions, utilizing Chain-of-Thought reasoning for improved preciseness and completeness. Experiments demonstrate a new state-of-the-art Effective Temporal F1 score of 43.65%, significantly outperforming models like Gemini 2.5 Pro and Seed-1.8. AI

IMPACT Establishes a new benchmark and dataset for multi-segment video retrieval, pushing the capabilities of MLLMs in complex temporal grounding tasks.

RANK_REASON The cluster contains a research paper introducing a new benchmark, dataset, and model for a specific AI task.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Jason Li ·

    Towards One-to-Many Temporal Grounding

    Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term O…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    Towards One-to-Many Temporal Grounding

    One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.

  3. arXiv cs.CV TIER_1 English(EN) · Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li ·

    Towards One-to-Many Temporal Grounding

    arXiv:2606.06294v1 Announce Type: new Abstract: Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint se…