Smol AI has released Common Corpus, a massive dataset comprising 2 trillion tokens derived from open-source data. This corpus is notable for its detailed provenance tracking, allowing users to trace the origin of the data. The release aims to provide a high-quality, transparent foundation for training large language models. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON Release of a large, open-source dataset for AI training.