English(EN) COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

新的COMET框架分析音频-文本AI中的模态差距

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-29 04:00

研究人员推出了一种新的框架COMET，用于分析像CLAP这样的音频-文本对比学习模型中的模态差距。COMET利用PLS-SVD方法揭示，只有一小部分轴（代表共享概念）对相似性计算有显著贡献，并且均值分量仅是模态差距的部分指标。该框架支持一种无需训练的光谱截断方法，该方法在大幅降低嵌入维度的同时，在音频字幕和检索等任务上保持了强大的性能，在零样本场景下接近完全监督的结果。 AI

排序理由该集群包含一篇研究论文，详细介绍了用于分析多模态嵌入的新框架。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

CLAP
COMET

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang · 2026-05-29 04:00

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

arXiv:2605.29628v1 Announce Type: cross Abstract: Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the mo…

报道来源 [1]

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

相关实体

相关话题