New benchmark CoMET-Bench tackles multi-event video grounding

By PulseAugur Editorial · [1 sources] · 2026-06-16 04:00

Researchers have introduced CoMET-Bench, a new benchmark designed for Conditional Multi-Event Temporal Grounding in long-form videos. Existing benchmarks are insufficient as they typically localize only a single event or treat grounding and counting as separate tasks. CoMET-Bench includes a large dataset with complex queries and proposes a unified evaluation protocol with a new Rejection-F1 metric to address limitations in current methods. A proposed agentic framework, CoMET-Agent, demonstrated improved performance over GPT-5 by reformulating the task as structured search and aggregation. AI

RANK_REASON The cluster contains a research paper introducing a new benchmark and methodology for video temporal grounding. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Yuanhao Zou, Arthad Kulkarni, Lucas Tonanez, Lincoln Spencer, Guangyu Sun, Tianxingjian Ding, Andong Deng, Yi Li, Shuangjun Liu, Yuan Li, Dashan Gao, Ning Bi, Taotao Jing, Shuai Zhang, Chen Chen · 2026-06-16 04:00

Conditional Multi-Event Temporal Grounding in Long-Form Video

arXiv:2606.15320v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing bench…

COVERAGE [1]

Conditional Multi-Event Temporal Grounding in Long-Form Video

RELATED ENTITIES

RELATED TOPICS