CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Researchers have introduced CommonLID, a new benchmark for language identification specifically designed for web data. This benchmark, which includes human annotations for 109 languages, aims to address the poor performance of existing models on noisy web text, particularly for under-served languages. Evaluations using CommonLID reveal that current language identification models often overestimate their accuracy on web data, highlighting the need for more robust evaluation methods and datasets. AI
IMPACT Highlights limitations in current language identification models, crucial for multilingual AI development and data curation.