New research offers unified control for audio-video generation

By PulseAugur Editorial · [3 sources] · 2026-06-29 00:00

Two new research papers introduce advanced methods for generating synchronized audio and video. MMControl focuses on unified multi-modal control, allowing users to influence character identity, voice, pose, and scene layout using various visual and acoustic signals. Unison aims to harmonize motion, speech, and sound by decoupling speech and sound effect generation and employing cross-modal synchronization strategies to improve coherence and reduce mismatches. AI

IMPACT These advancements could lead to more sophisticated and controllable AI-generated video content, impacting creative industries and synthetic media.

RANK_REASON Two research papers published on arXiv detailing new methods for audio-video generation.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

New research offers unified control for audio-video generation

COVERAGE [3]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-29 00:00

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.
arXiv cs.CV TIER_1 English(EN) · Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen · 2026-06-30 04:00

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

arXiv:2604.19679v3 Announce Type: replace Abstract: Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are ty…
arXiv cs.CV TIER_1 English(EN) · Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu · 2026-06-30 04:00

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

arXiv:2605.08729v2 Announce Type: replace Abstract: Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to …

COVERAGE [3]

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

RELATED ENTITIES

RELATED TOPICS