PulseAugur
EN
LIVE 09:25:50

New NEST dataset challenges AI models with narrative understanding in long videos

Researchers have introduced NEST, a new dataset designed to evaluate the narrative understanding capabilities of long video models. NEST comprises 1005 full-length movies, each annotated with over 100 multimodal narrative events that are linked through temporal, hierarchical, and long-range dependencies. The dataset aims to move beyond simple retrieval tasks to assess how models can comprehend complex narrative structures, including cause-and-effect relationships across extended periods and reframed events. Initial baseline results show significant challenges for models in event detection and argument extraction, though event relation extraction shows more promise. AI

IMPACT Introduces a challenging new benchmark for evaluating long-form video understanding in AI models, pushing the boundaries of narrative comprehension.

RANK_REASON The cluster describes a new academic dataset and benchmark for evaluating AI models, presented in a research paper on arXiv.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New NEST dataset challenges AI models with narrative understanding in long videos

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas ·

    NEST: Narrative Event Structures in Time for Long Video Understanding

    arXiv:2606.19706v1 Announce Type: cross Abstract: Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos.…

  2. arXiv cs.CL TIER_1 English(EN) · Chris Thomas ·

    NEST: Narrative Event Structures in Time for Long Video Understanding

    Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in…