Researchers have developed SomaliWeb v1, a new corpus of Somali text containing approximately 303 million tokens. This dataset was created through a reproducible six-stage pipeline, filtering data from HPLT v2, CC100, and Somali Wikipedia. The release also includes a matched BPE-16K tokenizer and the first public benchmark for Somali language identification, highlighting quality issues in existing datasets. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Provides essential resources for developing AI models for the Somali language, addressing a gap in low-resource language support.
RANK_REASON The cluster describes a new academic paper detailing the creation of a specialized language corpus and associated tools. [lever_c_demoted from research: ic=1 ai=1.0]