Researchers have introduced CoSPlan, a new benchmark designed to evaluate the sequential planning capabilities of vision-language models (VLMs) in visual domains. Unlike text-based planning, CoSPlan requires models to execute a series of visual actions, detect erroneous steps, and correct them to reach a target scene. Despite employing advanced strategies like Chain-of-Thought and Scene Graphs, VLMs struggle with CoSPlan. To address this, the paper proposes Scene Graph Incremental updates (SGI), a training-free method that refines textual scene graphs for step-by-step reasoning, showing an average improvement of 4.4% on CoSPlan with generalization to PlanBench and VQA. AI
IMPACT Introduces a new benchmark to push the capabilities of vision-language models in complex visual planning tasks.
RANK_REASON The cluster contains a research paper detailing a new benchmark and method for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]
- arXiv
- CoSPlan
- PlanBench
- Priyank Pathak
- Scene Graph Incremental updates
- Scene Graphs
- vision-language model
- visual question answering
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →