Smol AI releases Common Corpus with 2 trillion tokens and provenance tracking

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Smol AI has released Common Corpus, a massive dataset comprising 2 trillion tokens derived from open-source data. This corpus is notable for its detailed provenance tracking, allowing users to trace the origin of the data. The release aims to provide a high-quality, transparent foundation for training large language models. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Release of a large, open-source dataset for AI training.

Read on Smol AINews →

COVERAGE [1]

Smol AINews TIER_1 · 2024-11-14 01:54

Common Corpus: 2T Open Tokens with Provenance

**Pleais** via **Huggingface** released **Common Corpus**, the largest fully open multilingual dataset with over **2 trillion tokens** including detailed **provenance information**. They also introduced **OCRonos-Vintage**, a **124M-parameter OCR correction model** that efficient…

COVERAGE [1]

Common Corpus: 2T Open Tokens with Provenance

RELATED TOPICS