MoVA框架通过双重不对称投影增强视频-文本对齐

作者 PulseAugur 编辑部 · [2 个来源] · 2026-07-01 12:23

研究人员推出MoVA，一个旨在通过解决时间错位和语义不对称来改进视频-文本对齐的新框架。MoVA学习双重不对称投影，使其能够自适应地选择字幕的相关部分，并将文本相关的视觉概念与视频帧分离。这种方法使模型能够在处理不断演变、帧特定的概念和扩展到长视频及字幕的同时，保留全局跨模态语义，并在对齐任务中超越现有方法。 AI

影响这项研究可能催生更复杂的AI系统，能够更有效地理解和生成连接视频与文本的内容。

排序理由这是一篇详细介绍视频-文本对齐新模型的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

MoVA

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Peiyuan Zhu, Shaoan Xie, Zijian Li, Yifan Shen, Namrata Deka, Harsh Shrivastava, Guangyi Chen, Kun Zhang · 2026-07-02 04:00

MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment

arXiv:2607.00858v1 Announce Type: cross Abstract: Contrastive pre-training has propelled video-text alignment, yet models often inherit the critical limitations of their image-text predecessors like CLIP, resulting in entangled representations. These challenges are severely exace…
arXiv cs.LG TIER_1 English(EN) · Kun Zhang · 2026-07-01 12:23

MoVA：为模块化长视频-文本对齐学习不对称双投影

Contrastive pre-training has propelled video-text alignment, yet models often inherit the critical limitations of their image-text predecessors like CLIP, resulting in entangled representations. These challenges are severely exacerbated by two fundamental properties in the video …

报道来源 [2]

MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment

MoVA：为模块化长视频-文本对齐学习不对称双投影

相关实体

相关话题