Researchers have developed a new framework to enable multi-modal large language models (MLLMs) to reason over long-form egocentric videos, overcoming token limitations. The approach utilizes Egocentric Scene Graphs (EgoSGs), which are temporally grounded, structured representations of objects, attributes, spatial relations, and interactions. By converting videos into these compact, symbolic scene graphs, the method significantly reduces input length while preserving essential semantic and temporal information, allowing MLLMs to process entire video sequences within their context windows. This technique achieves state-of-the-art results on the HD-EPIC VQA benchmark, outperforming existing video-based baselines. AI
IMPACT Enables MLLMs to process and reason over extended video content, potentially improving applications in video analysis and understanding.
RANK_REASON Academic paper detailing a new method for video understanding with LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →