PulseAugur
EN
LIVE 12:32:49

New dialogue system integrates real-time facial generation with speech

Researchers have developed Moshi-Face, a novel full-duplex spoken dialogue system that integrates facial generation with audio processing. This system utilizes a VQ-VAE to encode facial data into discrete tokens and a Face Transformer to generate these tokens non-autoregressively. The result is a model capable of producing synchronized speech and facial expressions in real-time, maintaining dialogue quality while achieving audiovisual alignment at low latency. AI

IMPACT Enables more natural and expressive human-computer interactions by synchronizing speech with facial movements.

RANK_REASON Research paper detailing a new model release. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New dialogue system integrates real-time facial generation with speech

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Ryuichiro Higashinaka ·

    Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

    Full-duplex spoken dialogue models, such as Moshi, enable natural, low-latency voice conversations. However, they remain limited to the audio modality, lacking the facial expressions that are integral to human communication. We present Moshi-Face, the first full-duplex dialogue m…