PulseAugur
EN
LIVE 23:17:12

New benchmarks VGenST-Bench and CaST-Bench target MLLM spatio-temporal reasoning

Researchers have introduced two new benchmarks, VGenST-Bench and CaST-Bench, designed to more rigorously evaluate the spatio-temporal reasoning capabilities of Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs). VGenST-Bench utilizes active video synthesis to create controlled scenarios across various spatial and temporal dimensions, enabling fine-grained diagnosis of MLLM understanding. CaST-Bench focuses on causal chain-grounded spatio-temporal reasoning, requiring models to identify and localize evidence for cause-and-effect relationships in videos, highlighting current VLM limitations in this area. AI

IMPACT These benchmarks aim to improve the evaluation of AI models' understanding of real-world scenarios, pushing for more robust spatio-temporal and causal reasoning.

RANK_REASON The cluster describes the release of two new academic benchmarks for evaluating AI models.

Read on arXiv cs.MA (Multiagent) →

AI-generated summary · Google Gemini · from 6 sources. How we write summaries →

COVERAGE [6]

  1. arXiv cs.AI TIER_1 English(EN) · Eunbyung Park ·

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static ima…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.

  3. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Yunpu Ma ·

    PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

    Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in …

  4. arXiv cs.CV TIER_1 English(EN) · Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong ·

    CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

    arXiv:2605.23216v1 Announce Type: new Abstract: Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rare…

  5. arXiv cs.CV TIER_1 English(EN) · Quan Kong ·

    CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

    Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence n…

  6. arXiv cs.CV TIER_1 English(EN) · Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park ·

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    arXiv:2605.22570v1 Announce Type: new Abstract: Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning…