A new paper highlights significant compatibility issues with Creative Commons licenses used in African NLP corpora. The research found that common licenses like CC-BY-SA and CC-BY-NC are often incompatible when combined, and clauses like NoDerivs can prohibit essential data processing steps such as tokenization and annotation. The study details four failure modes, including outright prohibition, misrepresentation of composite licenses, hidden NoDerivs clauses, and data persistence failures, affecting corpora like JW300, WAXAL, Tanzil, and the Congolese Radio Corpus. AI
IMPACT Highlights critical data licensing challenges that could hinder the development of low-resource African language NLP models.
RANK_REASON Academic paper analyzing data licensing issues. [lever_c_demoted from research: ic=1 ai=1.0]
- Congolese Radio Corpus
- Creative Commons
- Creative Commons Attribution-NonCommercial
- Creative Commons Attribution-ShareAlike
- Ernst van Gassen
- Hugging Face
- Kituba
- Opus
- WAXAL
- Zarma
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →