English(EN) MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

MJEPA：统一的视听学习架构揭晓

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-23 22:48

研究人员推出 MJEPA，这是一种新颖的联合嵌入预测架构，专为视听学习而设计。该方法使用单一的统一编码器来处理两种模态，通过在模态之间和模态内部使用单一的预测目标来简化学习过程。研究表明，跨模态预测对性能至关重要，MJEPA 的表征受益于跨模态学习。MJEPA 模型取得了优异的成果，在 AudioSet-20K 上超越了之前的冻结基线，并在其他基准测试中取得了有竞争力的性能，同时使用的视频数据量显著减少。 AI

影响引入了一种统一的视听学习架构，有望简化和改进跨模态表征学习。

排序理由该条目描述了一篇介绍新颖视听学习架构的新研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-23 22:48

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it…

报道来源 [1]

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

相关实体

相关话题