New OMTG benchmark surpasses Gemini 2.5 Pro with novel reward functions

By PulseAugur Editorial · [3 sources] · 2026-06-04 00:00

Researchers have introduced a new benchmark and dataset for One-to-Many Temporal Grounding (OMTG), a task that involves localizing multiple video segments corresponding to a single text query. Existing multimodal large language models (MLLMs) struggle with OMTG due to a lack of event cardinality perception. The proposed solution includes novel temporal and caption reward functions, utilizing Chain-of-Thought reasoning for improved preciseness and completeness. Experiments demonstrate a new state-of-the-art Effective Temporal F1 score of 43.65%, significantly outperforming models like Gemini 2.5 Pro and Seed-1.8. AI

IMPACT Establishes a new benchmark and dataset for multi-segment video retrieval, pushing the capabilities of MLLMs in complex temporal grounding tasks.

RANK_REASON The cluster contains a research paper introducing a new benchmark, dataset, and model for a specific AI task.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

arXiv cs.AI TIER_1 English(EN) · Jason Li · 2026-06-04 15:31

Towards One-to-Many Temporal Grounding

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term O…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

Towards One-to-Many Temporal Grounding

One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.
arXiv cs.CV TIER_1 English(EN) · Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li · 2026-06-05 04:00

Towards One-to-Many Temporal Grounding

arXiv:2606.06294v1 Announce Type: new Abstract: Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint se…

COVERAGE [3]

Towards One-to-Many Temporal Grounding

Towards One-to-Many Temporal Grounding

Towards One-to-Many Temporal Grounding

RELATED ENTITIES

RELATED TOPICS