PulseAugur
EN
LIVE 17:57:33

User creates 103B token Usenet corpus for AI training

A user has compiled a 103 billion token Usenet corpus spanning from 1980 to 2013, emphasizing its value for training AI models due to its pre-internet, human-only content with zero AI contamination. The corpus is structured into various hierarchies like computing, science, and recreation, offering a unique dataset free from modern web artifacts and AI-generated text. A fine-tuned Gemma 4 model on a sample of this data has already demonstrated its potential, with samples available for download and the full corpus open for licensing. AI

IMPACT Provides a unique, AI-contamination-free dataset for fine-tuning models, potentially improving their raw human-like writing capabilities.

RANK_REASON User-created dataset release for AI training. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/OwnerByDane ·

    I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

    <!-- SC_OFF --><div class="md"><p>Posted this to <a href="/r/MachineLearning">r/MachineLearning</a> a couple weeks ago (30K views, 100+ upvotes) and have been meaning to share it here where the fine-tuning angle is more directly relevant.</p> <p>I spent years building and process…