PulseAugur
EN
LIVE 18:28:47

Zero-shot voice cloning enhances dysarthric ASR model training

Researchers have explored zero-shot voice cloning as a method to augment datasets for automatic speech recognition (ASR) systems trained on dysarthric speech. By cloning speakers from the TORGO dataset using Higgs Audio V2, they were able to fine-tune the Whisper-medium model. This approach achieved a Word Error Rate (WER) of 26.00%, which is competitive with models trained on real or hybrid data, and notably outperformed real data training for speakers with moderate to severe dysarthria. The findings suggest that zero-shot cloning offers a scalable solution to the data scarcity problem in dysarthric ASR. AI

IMPACT This research offers a scalable method to improve ASR for dysarthric speech, potentially increasing accessibility and usability of voice-enabled technologies for individuals with speech impairments.

RANK_REASON The cluster contains an academic paper detailing a new research methodology for improving ASR models.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Zero-shot voice cloning enhances dysarthric ASR model training

COVERAGE [2]

  1. arXiv cs.LG TIER_1 English(EN) · Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri ·

    Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

    arXiv:2606.19823v1 Announce Type: cross Abstract: Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specif…

  2. arXiv cs.LG TIER_1 English(EN) · Seyed Reza Shahamiri ·

    Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

    Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. …