New benchmark MindEdit-Bench reveals VLM struggles with counterfactual spatial reasoning

By PulseAugur Editorial · [2 sources] · 2026-07-01 06:19

Researchers have introduced MindEdit-Bench, a new benchmark designed to evaluate the object-level counterfactual spatial reasoning capabilities of vision-language models (VLMs). This benchmark utilizes triplets of photos from everyday indoor scenes, captured via a smartphone, and employs an automated pipeline for 3D scene-graph extraction. It includes tasks that probe perception and perspective transformations, as well as novel tasks focused on spatial editing and cross-view visibility editing, where correct answers are not present in the input images. Initial testing across 15 VLMs revealed significantly lower accuracy compared to human performance, highlighting a substantial gap in their ability to perform counterfactual spatial reasoning. AI

IMPACT Highlights a critical gap in VLM capabilities, potentially guiding future research towards more robust spatial understanding.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

MindEdit-Bench

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New benchmark MindEdit-Bench reveals VLM struggles with counterfactual spatial reasoning

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Leyuan Yu, Xiao Tang, Minghao Liu, Xinyuan Li, Xiaokai Bai, Sheng Zhou, Qunshu Lin, Weihao Xuan, Naoto Yokoya · 2026-07-02 04:00

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed. Ca…
arXiv cs.AI TIER_1 English(EN) · Naoto Yokoya · 2026-07-01 06:19

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed. Can VLMs instead predict the consequences of hypothe…

COVERAGE [2]

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

RELATED ENTITIES

RELATED TOPICS