A new dataset called FineWeb has been released, comprising 15 trillion tokens derived from 12 years of Common Crawl data. This dataset has undergone significant deduplication and filtering processes. The goal of FineWeb is to provide a cleaner and more refined corpus for training large language models. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON Release of a large-scale dataset for LLM training, akin to a research artifact.