SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
Researchers have developed SceneGraphVLM, a novel method for generating dynamic scene graphs from videos using compact vision-language models. This approach serializes graphs into an efficient TOON format and employs a two-stage training process, including reinforcement learning with specialized rewards to improve precision and reduce irrelevant objects. SceneGraphVLM offers a strong quality-speed trade-off, achieving near real-time performance with vLLM acceleration and providing lightweight temporal context for video analysis. AI
IMPACT Introduces a more efficient method for structured visual perception from video, potentially improving downstream AI tasks that rely on understanding scene context.