PulseAugur
EN
LIVE 17:23:38

Google releases WAXAL dataset for 27 African languages, AfriVoices-KE adds Kenyan languages · 2 sources…

Google Research has released WAXAL, a large-scale, open-access speech dataset covering 27 African languages, aiming to bridge the digital divide in speech technology. The dataset includes approximately 1,846 hours for ASR and over 565 hours for TTS, collected through collaborative efforts with African academic and community organizations. Concurrently, a new dataset called AfriVoices-KE has been published, featuring around 3,000 hours of audio across five Kenyan languages, with a mix of scripted and spontaneous speech. Both initiatives aim to foster the development of inclusive voice-enabled technologies and preserve linguistic heritage. AI

IMPACT These datasets are foundational for developing inclusive speech technologies and preserving linguistic diversity in underrepresented regions.

RANK_REASON The cluster describes the release of large-scale speech datasets for African languages, which constitutes a research milestone in AI.

Read on Google AI / Research →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Google releases WAXAL dataset for 27 African languages, AfriVoices-KE adds Kenyan languages · 2 sources…

COVERAGE [2]

  1. Google AI / Research TIER_1 English(EN) ·

    WAXAL: A large-scale open resource for African language speech technology

    Natural Language Processing

  2. arXiv cs.CL TIER_1 English(EN) · Lilian Wanzare, Cynthia Amol, Ezekiel Maina, Nelson Odhiambo, Hope Kerubo, Leila Misula, Vivian Oloo, Rennish Mboya, Edwin Onkoba, Edward Ombui, Joseph Muguro, Ciira wa Maina, Andrew Kipkebut, Alfred Omondi Otom, Ian Ndung'u Kang'ethe, Angela Wambui Kany… ·

    AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

    arXiv:2604.08448v2 Announce Type: replace Abstract: AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted spee…