PulseAugur
EN
LIVE 15:13:56

TokTalk generates facial animation from audio tokens

Researchers have developed TokTalk, a system that generates expressive real-time facial animations directly from audio tokens produced by advanced language models. This approach bypasses traditional multi-stage processes like speech recognition and synthesis, aiming to create more natural and responsive avatar performances. TokTalk utilizes a novel dataset and a Chunk-based Conditional Flow Matching model, demonstrating competitive latency and superior quality in perceptual studies. AI

IMPACT Enables more natural and responsive avatar performances by directly using audio tokens from LLMs.

RANK_REASON The cluster contains a research paper detailing a new system for generating facial animations.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CV TIER_1 English(EN) · Qingcheng Zhao, Yifang Pan, Karan Singh ·

    TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

    arXiv:2605.31294v1 Announce Type: new Abstract: Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to seq…

  2. arXiv cs.CV TIER_1 English(EN) · Karan Singh ·

    TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

    Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text gener…