New V-ASR system uses phoneme prediction and LLM for improved accuracy

By PulseAugur Editorial · [1 sources] · 2026-06-02 04:00

Researchers have developed a new two-stage framework for visual automatic speech recognition (V-ASR) that aims to improve accuracy by focusing on phonemes rather than direct word prediction. The system first fuses visual cues and facial landmark motion features to predict phonemes, then utilizes a large language model (LLM) called NLLB for word reconstruction. This approach reportedly achieves a 17.4% word error rate on the LRS2 dataset and 21.0% on LRS3, outperforming previous methods that struggled with viseme ambiguity. AI

IMPACT This phoneme-based approach could lead to more robust speech recognition systems, particularly in noisy environments or for individuals with speech impediments.

RANK_REASON The cluster contains an academic paper detailing a new method for visual speech recognition. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New V-ASR system uses phoneme prediction and LLM for improved accuracy

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh · 2026-06-02 04:00

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

arXiv:2507.18863v2 Announce Type: replace-cross Abstract: Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging du…

COVERAGE [1]

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

RELATED ENTITIES

RELATED TOPICS