New LLM reportedly trained on over 100 trillion tokens

By PulseAugur Editorial · [1 sources] · 2026-06-01 04:38

A new large language model is reportedly being trained on over 100 trillion tokens, a significant increase from the 27-50 trillion tokens typically used by current models. This massive dataset size suggests a substantial increase in computational resources required for training. The model, potentially named M3, is speculated to have fewer than 500 billion parameters despite the vast training data. AI

IMPACT This massive dataset size could indicate a new frontier in LLM training, potentially leading to more capable models if the computational challenges are met.

RANK_REASON The cluster discusses a potential new model training dataset size, which is a research-related topic. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

model release

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New LLM reportedly trained on over 100 trillion tokens

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/True_Requirement_891 · 2026-06-01 04:38

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on.

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tthnru/100_trillion_pretraining_data_this_is_the_largest/"> <img alt="100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on." src="https://preview.redd.it/oss7g2gnll4h1.…

COVERAGE [1]

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on.

RELATED ENTITIES

RELATED TOPICS