PulseAugur
LIVE 13:13:12
research · [1 source] ·
0
research

Smol AI releases Common Corpus with 2 trillion tokens and provenance tracking

Smol AI has released Common Corpus, a massive dataset comprising 2 trillion tokens derived from open-source data. This corpus is notable for its detailed provenance tracking, allowing users to trace the origin of the data. The release aims to provide a high-quality, transparent foundation for training large language models. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Release of a large, open-source dataset for AI training.

Read on Smol AINews →

COVERAGE [1]

  1. Smol AINews TIER_1 ·

    Common Corpus: 2T Open Tokens with Provenance

    **Pleais** via **Huggingface** released **Common Corpus**, the largest fully open multilingual dataset with over **2 trillion tokens** including detailed **provenance information**. They also introduced **OCRonos-Vintage**, a **124M-parameter OCR correction model** that efficient…