PulseAugur
实时 05:30:08
English(EN) AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

新研究为音视频生成提供统一控制

两篇新研究论文介绍了同步生成音频和视频的先进方法。MMControl 专注于统一的多模态控制,允许用户使用各种视觉和听觉信号来影响角色身份、声音、姿势和场景布局。Unison 旨在通过解耦语音和音效生成,并采用跨模态同步策略来提高连贯性并减少不匹配,从而协调运动、语音和声音。 AI

影响 这些进步可能带来更复杂、更可控的 AI 生成视频内容,对创意产业和合成媒体产生影响。

排序理由 arXiv 上发表了两篇研究论文,详细介绍了音频-视频生成的新方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

新研究为音视频生成提供统一控制

报道来源 [3]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

    AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.

  2. arXiv cs.CV TIER_1 English(EN) · Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen ·

    MMControl:统一的多模态控制,用于联合音视频生成

    arXiv:2604.19679v3 Announce Type: replace Abstract: Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are ty…

  3. arXiv cs.CV TIER_1 English(EN) · Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu ·

    Unison:为以人为本的音视频生成实现运动、语音和声音的和谐统一

    arXiv:2605.08729v2 Announce Type: replace Abstract: Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to …