Brief · PulseAugur

RESEARCH · arXiv cs.MA (Multiagent) English(EN) · 1w · [8 sources]

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Researchers have introduced two new benchmarks, VGenST-Bench and CaST-Bench, designed to more rigorously evaluate the spatio-temporal reasoning capabilities of Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs). VGenST-Bench utilizes active video synthesis to create controlled scenarios across various spatial and temporal dimensions, enabling fine-grained diagnosis of MLLM understanding. CaST-Bench focuses on causal chain-grounded spatio-temporal reasoning, requiring models to identify and localize evidence for cause-and-effect relationships in videos, highlighting current VLM limitations in this area. AI

IMPACT These benchmarks aim to improve the evaluation of AI models' understanding of real-world scenarios, pushing for more robust spatio-temporal and causal reasoning.

Multimodal Large Language Models
VGenST-Bench
MLLMs
Vision-Language Models
CaST-Bench