Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming
Researchers have developed a novel method for multilingual word-level forced alignment, integrating representations from the Massively Multilingual Speech (MMS) model and a self-supervised phoneme boundary detector. This approach uses a learned dynamic programming decoder to infer precise word boundaries. The system demonstrated superior performance compared to existing methods like Montreal Forced Aligner (MFA) on TIMIT and Buckeye datasets, and showed promising results on unseen languages, suggesting scalability across over 1100 languages supported by MMS. AI
IMPACT Enhances accuracy in multilingual speech processing, potentially improving cross-lingual AI applications.