PulseAugur
实时 08:05:20

MoVA框架通过双重不对称投影增强视频-文本对齐

研究人员推出MoVA,一个旨在通过解决时间错位和语义不对称来改进视频-文本对齐的新框架。MoVA学习双重不对称投影,使其能够自适应地选择字幕的相关部分,并将文本相关的视觉概念与视频帧分离。这种方法使模型能够在处理不断演变、帧特定的概念和扩展到长视频及字幕的同时,保留全局跨模态语义,并在对齐任务中超越现有方法。 AI

影响 这项研究可能催生更复杂的AI系统,能够更有效地理解和生成连接视频与文本的内容。

排序理由 这是一篇详细介绍视频-文本对齐新模型的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

MoVA框架通过双重不对称投影增强视频-文本对齐

报道来源 [2]

  1. arXiv cs.LG TIER_1 English(EN) · Peiyuan Zhu, Shaoan Xie, Zijian Li, Yifan Shen, Namrata Deka, Harsh Shrivastava, Guangyi Chen, Kun Zhang ·

    MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment

    arXiv:2607.00858v1 Announce Type: cross Abstract: Contrastive pre-training has propelled video-text alignment, yet models often inherit the critical limitations of their image-text predecessors like CLIP, resulting in entangled representations. These challenges are severely exace…

  2. arXiv cs.LG TIER_1 English(EN) · Kun Zhang ·

    MoVA:为模块化长视频-文本对齐学习不对称双投影

    Contrastive pre-training has propelled video-text alignment, yet models often inherit the critical limitations of their image-text predecessors like CLIP, resulting in entangled representations. These challenges are severely exacerbated by two fundamental properties in the video …