Smol AI releases FineWeb dataset with 15T tokens from CommonCrawl

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new dataset called FineWeb has been released, comprising 15 trillion tokens derived from 12 years of Common Crawl data. This dataset has undergone significant deduplication and filtering processes. The goal of FineWeb is to provide a cleaner and more refined corpus for training large language models. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Release of a large-scale dataset for LLM training, akin to a research artifact.

Read on Smol AINews →

COVERAGE [1]

Smol AINews TIER_1 · 2024-04-23 00:03

FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)

**2024** has seen a significant increase in dataset sizes for training large language models, with **Redpajama 2** offering up to **30T tokens**, **DBRX** at **12T tokens**, **Reka Core/Flash/Edge** with **5T tokens**, and **Llama 3** trained on **15T tokens**. **Huggingface** re…

COVERAGE [1]

FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)

RELATED TOPICS