Researchers have introduced SceneBench, a new benchmark designed to evaluate video understanding models' ability to retain context over long videos, particularly across different scenes. Their findings indicate that current vision-language models (VLMs) exhibit significant forgetting when asked questions that require reasoning over extended temporal information. To address this, they propose Scene-RAG, a retrieval-augmented generation method that improves VLM performance by 2.50% by dynamically integrating relevant context across scenes, highlighting the ongoing challenge in developing robust long-context retention for VLMs. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights limitations in current VLMs for long-context video understanding, potentially guiding future research towards more robust temporal reasoning capabilities.
RANK_REASON This is a research paper introducing a new benchmark and method for evaluating video understanding models. [lever_c_demoted from research: ic=1 ai=1.0]