Egocentric Scene Graphs Enable MLLMs to Reason Over Long Videos

By PulseAugur Editorial · [2 sources] · 2026-06-24 13:55

Researchers have developed a new framework to enable multi-modal large language models (MLLMs) to reason over long-form egocentric videos, overcoming token limitations. The approach utilizes Egocentric Scene Graphs (EgoSGs), which are temporally grounded, structured representations of objects, attributes, spatial relations, and interactions. By converting videos into these compact, symbolic scene graphs, the method significantly reduces input length while preserving essential semantic and temporal information, allowing MLLMs to process entire video sequences within their context windows. This technique achieves state-of-the-art results on the HD-EPIC VQA benchmark, outperforming existing video-based baselines. AI

IMPACT Enables MLLMs to process and reason over extended video content, potentially improving applications in video analysis and understanding.

RANK_REASON Academic paper detailing a new method for video understanding with LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Egocentric Scene Graphs Enable MLLMs to Reason Over Long Videos

COVERAGE [2]

arXiv cs.CV TIER_1 English(EN) · Agnese Taluzzi, Riccardo Santambrogio, Simone Mentasti, Chiara Plizzari, Matteo Matteucci · 2026-06-25 04:00

Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs

arXiv:2606.25842v1 Announce Type: new Abstract: Existing multi-modal large language models (MLLMs) face significant challenges in processing long video sequences due to strict input token limitations. As a result, current video understanding approaches, especially in egocentric s…
arXiv cs.CV TIER_1 English(EN) · Matteo Matteucci · 2026-06-24 13:55

Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs

Existing multi-modal large language models (MLLMs) face significant challenges in processing long video sequences due to strict input token limitations. As a result, current video understanding approaches, especially in egocentric settings characterized by complex dynamics, frequ…

COVERAGE [2]

Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs

Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs

RELATED ENTITIES

RELATED TOPICS