New benchmarks VGenST-Bench and CaST-Bench target MLLM spatio-temporal reasoning

By PulseAugur Editorial · [6 sources] · 2026-05-16 16:15

Researchers have introduced two new benchmarks, VGenST-Bench and CaST-Bench, designed to more rigorously evaluate the spatio-temporal reasoning capabilities of Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs). VGenST-Bench utilizes active video synthesis to create controlled scenarios across various spatial and temporal dimensions, enabling fine-grained diagnosis of MLLM understanding. CaST-Bench focuses on causal chain-grounded spatio-temporal reasoning, requiring models to identify and localize evidence for cause-and-effect relationships in videos, highlighting current VLM limitations in this area. AI

IMPACT These benchmarks aim to improve the evaluation of AI models' understanding of real-world scenarios, pushing for more robust spatio-temporal and causal reasoning.

RANK_REASON The cluster describes the release of two new academic benchmarks for evaluating AI models.

Read on arXiv cs.MA (Multiagent) →

paper
other

AI-generated summary · Google Gemini · from 6 sources. How we write summaries →

COVERAGE [6]

arXiv cs.AI TIER_1 English(EN) · Eunbyung Park · 2026-05-21 14:48

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static ima…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.
arXiv cs.MA (Multiagent) TIER_1 English(EN) · Yunpu Ma · 2026-05-16 16:15

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in …
arXiv cs.CV TIER_1 English(EN) · Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong · 2026-05-25 04:00

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

arXiv:2605.23216v1 Announce Type: new Abstract: Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rare…
arXiv cs.CV TIER_1 English(EN) · Quan Kong · 2026-05-22 04:19

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence n…
arXiv cs.CV TIER_1 English(EN) · Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park · 2026-05-22 04:00

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

arXiv:2605.22570v1 Announce Type: new Abstract: Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning…

COVERAGE [6]

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

RELATED ENTITIES

RELATED TOPICS