English(EN) ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

ReasonCLIP-58M通过视觉常识推理增强CLIP模型

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-25 09:27

研究人员推出ReasonCLIP-58M，一个用于持续预训练CLIP风格模型的新框架。该方法整合了大规模推理监督，以增强视觉基础的常识推理和组合推理能力。该框架采用两阶段策略，在逐步添加推理信号的同时保持描述性对齐，并得到了新的数据集和诊断评估基准的支持。ReasonCLIP-58M可用作多模态大型语言模型的即插即用视觉编码器，在不增加推理成本的情况下提高性能。 AI

影响增强了多模态模型中的视觉推理能力，可能在需要更深入图像理解的应用中提高性能。

排序理由该集群包含一篇详细介绍预训练视觉模型新方法的论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Sicheng Zhang, Muzammal Naseer, Binzhu Xie, Naufal Suryanto, Shi Qiu, Jamal Bentahar, Naveed Akhtar, Mubarak Shah · 2026-06-26 04:00

ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

arXiv:2606.26794v1 Announce Type: cross Abstract: CLIP and its variants are widely adopted visual backbones in multimodal systems, but their pretraining remains dominated by descriptive image-text alignment. As downstream applications increasingly demand visually grounded commons…
arXiv cs.CV TIER_1 English(EN) · Mubarak Shah · 2026-06-25 09:27

ReasonCLIP-58M: CLIP的视觉基础常识推理监督

CLIP and its variants are widely adopted visual backbones in multimodal systems, but their pretraining remains dominated by descriptive image-text alignment. As downstream applications increasingly demand visually grounded commonsense inference and compositional reasoning, it rem…

报道来源 [2]

ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

ReasonCLIP-58M: CLIP的视觉基础常识推理监督

相关实体

相关话题