Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)
Researchers have developed a new method for fine-tuning OpenAI's Whisper model to improve Swiss German Automatic Speech Recognition (ASR). Their approach uses Standard German subtitles as weak supervision, achieving a 25.6% Word Error Rate (WER) on a test set with strictly disjoint data. A harmonized error analysis revealed a content WER of 13.8%, suggesting the true error rate is significantly lower than measured WER. The study also found that existing state-of-the-art results for Swiss German ASR were inflated due to benchmark contamination, with a vanilla Whisper model achieving a lower WER without specific Swiss German training. AI
IMPACT Highlights potential for improved ASR in low-resource languages and the need for rigorous benchmark evaluation.