Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
Researchers have developed a new two-stage framework for visual automatic speech recognition (V-ASR) that aims to improve accuracy by focusing on phonemes rather than direct word prediction. The system first fuses visual cues and facial landmark motion features to predict phonemes, then utilizes a large language model (LLM) called NLLB for word reconstruction. This approach reportedly achieves a 17.4% word error rate on the LRS2 dataset and 21.0% on LRS3, outperforming previous methods that struggled with viseme ambiguity. AI
IMPACT This phoneme-based approach could lead to more robust speech recognition systems, particularly in noisy environments or for individuals with speech impediments.