MoCoTalk diffusion model generates controllable talking head videos

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed MoCoTalk, a novel diffusion model designed for generating controllable talking head videos. This framework integrates four distinct conditional inputs: a reference image, facial keypoints, 3D facial models, and speech audio. To manage the interplay between these varied conditions, an adaptive router dynamically adjusts feature fusion based on the noise level and feature subspace. The model also introduces a mouth-augmented shading mesh for improved geometric consistency and a lip-sync loss for tighter audio-visual alignment, achieving state-of-the-art results. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a new method for controllable video generation, potentially impacting synthetic media and virtual avatars.

RANK_REASON Publication of a new academic paper on arXiv detailing a novel AI model. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Abbas Edalat · 2026-05-08 17:40

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTal…

COVERAGE [1]

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

RELATED ENTITIES

RELATED TOPICS