TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens
Researchers have developed TokTalk, a system that generates expressive real-time facial animations directly from audio tokens produced by advanced language models. This approach bypasses traditional multi-stage processes like speech recognition and synthesis, aiming to create more natural and responsive avatar performances. TokTalk utilizes a novel dataset and a Chunk-based Conditional Flow Matching model, demonstrating competitive latency and superior quality in perceptual studies. AI
IMPACT Enables more natural and responsive avatar performances by directly using audio tokens from LLMs.