English(EN) Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

新的MACCO框架增强了视觉语言模型组合性

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-11 12:45

研究人员开发了MACCO，一个旨在提高视觉语言模型（VLMs）组合性理解的新框架。MACCO通过掩码一种模态中的组合概念，并利用另一种模态的上下文信息进行重构，从而解决了现有模型在物体关系、属性-物体绑定和词序方面常常遇到的局限性。这种方法增强了跨模态组合结构的对齐，并在多个基准测试中显著提高了组合性、句法结构捕获和语言信息处理能力。该框架还有益于文本到图像生成和多模态大型语言模型等下游应用。 AI

影响增强了视觉语言模型理解复杂关系和结构的能力，可能改进多模态AI应用。

排序理由这是一篇详细介绍用于改进视觉语言模型的新框架的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Wei Li, Zhen Huang, Xinmei Tian · 2026-06-12 04:00

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

arXiv:2606.13288v1 Announce Type: cross Abstract: Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" beha…
arXiv cs.AI TIER_1 English(EN) · Xinmei Tian · 2026-06-11 12:45

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, …

报道来源 [2]

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

相关实体

相关话题