English(EN) Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

新的Gen-VCoT框架为多模态AI生成视觉推理步骤

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-15 14:27

研究人员推出Gen-VCoT，一个旨在通过生成视觉思维链（CoT）推理步骤来增强多模态大语言模型（MLLMs）的新型框架。与依赖文本CoT或不透明令牌的现有方法不同，Gen-VCoT利用专家视觉模型生成可解释的RGB图像作为中间推理表示。该方法包括使用SAM进行视觉接地、使用Marigold深度图进行几何推理以及与Qwen2-VL集成的语义推理，并由一个自适应路由器控制推理深度。虽然Gen-VCoT在空间和深度相关问题上显示出显著的改进，但其在简单事实查询上的性能可能会受到影响，并且对于CLEVR等某些任务，文本CoT仍然更优。 AI

影响通过生成视觉中间表示，为可解释的多模态推理建立新范例。

排序理由该集群描述了一篇详细介绍AI推理新框架的新研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Zhiqiang Zhou, Junliang Dai, Xu ling · 2026-06-16 04:00

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

arXiv:2606.16783v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key proper…
arXiv cs.CV TIER_1 English(EN) · Xu ling · 2026-06-15 14:27

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using exper…

报道来源 [2]

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

相关实体

相关话题