New CoSPlan benchmark challenges vision-language models in visual planning tasks

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have introduced CoSPlan, a new benchmark designed to evaluate the sequential planning capabilities of vision-language models (VLMs) in visual domains. Unlike text-based planning, CoSPlan requires models to execute a series of visual actions, detect erroneous steps, and correct them to reach a target scene. Despite employing advanced strategies like Chain-of-Thought and Scene Graphs, VLMs struggle with CoSPlan. To address this, the paper proposes Scene Graph Incremental updates (SGI), a training-free method that refines textual scene graphs for step-by-step reasoning, showing an average improvement of 4.4% on CoSPlan with generalization to PlanBench and VQA. AI

IMPACT Introduces a new benchmark to push the capabilities of vision-language models in complex visual planning tasks.

RANK_REASON The cluster contains a research paper detailing a new benchmark and method for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New CoSPlan benchmark challenges vision-language models in visual planning tasks

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Shresth Grover, Priyank Pathak, Akash Kumar, Yogesh S Rawat · 2026-06-30 04:00

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

arXiv:2512.10342v3 Announce Type: replace Abstract: Vision Language Models (VLMs) have shown promising planning capabilities, yet their success remains confined to the text domain, leaving visual decision-making relatively underexplored. Addressing this gap, we introduce Correcti…

COVERAGE [1]

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

RELATED ENTITIES

RELATED TOPICS