PulseAugur
实时 22:15:25

新基准 VGenST-Bench 和 CaST-Bench 旨在解决 MLLM 的时空推理能力

研究人员推出了两个新的基准测试,VGenST-BenchCaST-Bench,旨在更严格地评估多模态大语言模型 (MLLM) 和视觉语言模型 (VLM) 的时空推理能力。VGenST-Bench 利用主动视频合成,在各种空间和时间维度上创建受控场景,从而能够对 MLLM 的理解进行细粒度诊断。CaST-Bench 侧重于因果链式时空推理,要求模型识别和定位视频中因果关系的证据,突显了当前 VLM 在该领域的局限性。 AI

影响 这些基准测试旨在改进对 AI 模型理解真实世界场景的评估,推动更强大的时空和因果推理能力。

排序理由 该集群描述了两个用于评估 AI 模型的新学术基准的发布。

在 arXiv cs.MA (Multiagent) 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

报道来源 [6]

  1. arXiv cs.AI TIER_1 English(EN) · Eunbyung Park ·

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static ima…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.

  3. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Yunpu Ma ·

    PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

    Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in …

  4. arXiv cs.CV TIER_1 English(EN) · Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong ·

    CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

    arXiv:2605.23216v1 Announce Type: new Abstract: Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rare…

  5. arXiv cs.CV TIER_1 English(EN) · Quan Kong ·

    CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

    Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence n…

  6. arXiv cs.CV TIER_1 English(EN) · Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park ·

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    arXiv:2605.22570v1 Announce Type: new Abstract: Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning…