Conditional Multi-Event Temporal Grounding in Long-Form Video
Researchers have introduced CoMET-Bench, a new benchmark designed for Conditional Multi-Event Temporal Grounding in long-form videos. Existing benchmarks are insufficient as they typically localize only a single event or treat grounding and counting as separate tasks. CoMET-Bench includes a large dataset with complex queries and proposes a unified evaluation protocol with a new Rejection-F1 metric to address limitations in current methods. A proposed agentic framework, CoMET-Agent, demonstrated improved performance over GPT-5 by reformulating the task as structured search and aggregation. AI