新基准 VGenST-Bench 和 CaST-Bench 旨在解决 MLLM 的时空推理能力

作者 PulseAugur 编辑部 · [6 个来源] · 2026-05-16 16:15

研究人员推出了两个新的基准测试，VGenST-Bench 和 CaST-Bench，旨在更严格地评估多模态大语言模型 (MLLM) 和视觉语言模型 (VLM) 的时空推理能力。VGenST-Bench 利用主动视频合成，在各种空间和时间维度上创建受控场景，从而能够对 MLLM 的理解进行细粒度诊断。CaST-Bench 侧重于因果链式时空推理，要求模型识别和定位视频中因果关系的证据，突显了当前 VLM 在该领域的局限性。 AI

影响这些基准测试旨在改进对 AI 模型理解真实世界场景的评估，推动更强大的时空和因果推理能力。

排序理由该集群描述了两个用于评估 AI 模型的新学术基准的发布。

在 arXiv cs.MA (Multiagent) 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

arXiv cs.AI TIER_1 English(EN) · Eunbyung Park · 2026-05-21 14:48

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static ima…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.
arXiv cs.MA (Multiagent) TIER_1 English(EN) · Yunpu Ma · 2026-05-16 16:15

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in …
arXiv cs.CV TIER_1 English(EN) · Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong · 2026-05-25 04:00

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

arXiv:2605.23216v1 Announce Type: new Abstract: Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rare…
arXiv cs.CV TIER_1 English(EN) · Quan Kong · 2026-05-22 04:19

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence n…
arXiv cs.CV TIER_1 English(EN) · Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park · 2026-05-22 04:00

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

arXiv:2605.22570v1 Announce Type: new Abstract: Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning…

报道来源 [6]

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

相关实体

相关话题