New dialogue system integrates real-time facial generation with speech

By PulseAugur Editorial · [1 sources] · 2026-06-20 09:59

Researchers have developed Moshi-Face, a novel full-duplex spoken dialogue system that integrates facial generation with audio processing. This system utilizes a VQ-VAE to encode facial data into discrete tokens and a Face Transformer to generate these tokens non-autoregressively. The result is a model capable of producing synchronized speech and facial expressions in real-time, maintaining dialogue quality while achieving audiovisual alignment at low latency. AI

IMPACT Enables more natural and expressive human-computer interactions by synchronizing speech with facial movements.

RANK_REASON Research paper detailing a new model release. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New dialogue system integrates real-time facial generation with speech

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Ryuichiro Higashinaka · 2026-06-20 09:59

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

Full-duplex spoken dialogue models, such as Moshi, enable natural, low-latency voice conversations. However, they remain limited to the audio modality, lacking the facial expressions that are integral to human communication. We present Moshi-Face, the first full-duplex dialogue m…

COVERAGE [1]

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

RELATED ENTITIES

RELATED TOPICS