New MuVAP Framework Predicts Turn-Taking in Multiparty Conversations

By PulseAugur Editorial · [2 sources] · 2026-06-15 13:54

Researchers have introduced MuVAP, a novel multimodal framework designed for predicting turn-taking in multiparty conversations. This system extends Voice Activity Projection by integrating acoustic predictions with face tracking from a single camera and monaural audio stream, making it suitable for human-robot interaction. To handle the complexity of multiple speakers, MuVAP employs Role-Relative Projection. The framework is validated using the newly created Audio-Visual Conversation Corpus, a 31-hour dataset of unedited conversations, and demonstrates superior performance on turn-taking prediction tasks compared to existing baselines. AI

IMPACT This framework could enhance human-robot interaction by enabling more natural turn-taking in conversations.

RANK_REASON The cluster describes a new research paper published on arXiv detailing a novel framework and dataset for conversational AI.

Read on arXiv cs.AI →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Haotian Qi, Gabriel Skantze · 2026-06-16 04:00

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

arXiv:2606.16731v1 Announce Type: cross Abstract: Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extend…
arXiv cs.AI TIER_1 English(EN) · Gabriel Skantze · 2026-06-15 13:54

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic …

COVERAGE [2]

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

RELATED TOPICS