A user has compiled a 103 billion token Usenet corpus spanning from 1980 to 2013, emphasizing its value for training AI models due to its pre-internet, human-only content with zero AI contamination. The corpus is structured into various hierarchies like computing, science, and recreation, offering a unique dataset free from modern web artifacts and AI-generated text. A fine-tuned Gemma 4 model on a sample of this data has already demonstrated its potential, with samples available for download and the full corpus open for licensing. AI
IMPACT Provides a unique, AI-contamination-free dataset for fine-tuning models, potentially improving their raw human-like writing capabilities.
RANK_REASON User-created dataset release for AI training. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →