Researchers have developed SomaliWeb v1, a new corpus of Somali text containing approximately 303 million tokens. This dataset was created through a reproducible six-stage pipeline, filtering data from HPLT v2, CC100, and Somali Wikipedia. The release also includes a matched BPE-16K tokenizer and the first public benchmark for Somali language identification, highlighting quality issues in existing datasets. AI
IMPACT Provides essential resources for developing AI models for the Somali language, addressing a gap in low-resource language support.
RANK_REASON The cluster describes a new academic paper detailing the creation of a specialized language corpus and associated tools. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →