Brief

last 24h

[2/2] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CL English(EN) · 13h

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Researchers have developed a new two-stage framework for visual automatic speech recognition (V-ASR) that aims to improve accuracy by focusing on phonemes rather than direct word prediction. The system first fuses visual cues and facial landmark motion features to predict phonemes, then utilizes a large language model (LLM) called NLLB for word reconstruction. This approach reportedly achieves a 17.4% word error rate on the LRS2 dataset and 21.0% on LRS3, outperforming previous methods that struggled with viseme ambiguity. AI

IMPACT This phoneme-based approach could lead to more robust speech recognition systems, particularly in noisy environments or for individuals with speech impediments.
- Matthew Kit Khinn Teng
- NLLB
TOOL · arXiv cs.CV English(EN) · 13h

Head-Pose-Aware Visual Speech Recognition with FiLM Modulation

Researchers have developed a new framework called HP-VSR-ResFiLM to improve visual speech recognition (VSR) by explicitly incorporating head-pose information. This method uses a pose-conditioned residual Feature-wise Linear Modulation (FiLM) block to adapt visual features based on head orientation, addressing challenges like geometric distortions and occlusions. Experiments on LRS2 and LRS3 datasets showed competitive performance, with word error rates of 25.0% and 33.2% respectively, demonstrating improved robustness for unconstrained VSR scenarios. AI

IMPACT Enhances robustness of speech recognition systems in real-world, unconstrained environments.

Brief

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Head-Pose-Aware Visual Speech Recognition with FiLM Modulation