PulseAugur
EN
LIVE 13:33:00

OpenLID-v3 enhances language identification for closely related languages

Researchers have developed OpenLID-v3, an enhanced language identification system designed to improve the accuracy of distinguishing closely related languages and filtering out noise from web data. The updated system incorporates more training data, merges problematic language variant clusters, and introduces a specific label for noise detection. Evaluations against existing tools like GlotLID on various benchmarks, with a focus on language groups such as Slavic, Romance, and Scandinavian languages, indicate that while ensemble approaches boost precision, they can reduce coverage for low-resource languages. The OpenLID-v3 system and its associated datasets are now publicly available. AI

RANK_REASON The cluster contains an academic paper detailing a new version of a language identification system. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jind\v{r}ich Helcl, Stephan Oepen, Egil R{\o}nningstad, Yves Scherrer ·

    OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

    arXiv:2602.13139v3 Announce Type: replace Abstract: Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to disting…