PulseAugur
EN
LIVE 17:40:18

LLMs now trained on AI-generated data, revealing complex model dependencies

Large language models are increasingly being trained on data generated and filtered by other AI models, rather than solely on human-created data. This shift involves complex interdependencies, with models like Olmo 3 relying on 89 other models and 183 datasets, and Nemotron 3 depending on 273 models and 560 datasets. To help researchers navigate this intricate web of dependencies, the creators have developed a tool called ModSleuth. AI

IMPACT Highlights the growing reliance on synthetic data and complex model interdependencies in LLM development, impacting training efficiency and transparency.

RANK_REASON The cluster discusses a new method for tracing data dependencies in LLMs, which is a research-oriented topic. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Bluesky Jetstream — AI desk →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. Bluesky Jetstream — AI desk TIER_1 English(EN) · ai2.bsky.social ·

    LLMs are no longer created w/ human data alone. They rely on other models to generate & filter data, evaluate outputs, & guide dev work.

    LLMs are no longer created w/ human data alone. They rely on other models to generate & filter data, evaluate outputs, & guide dev work. So what is a modern LLM built on? Olmo 3 → 89 model + 183 dataset dependencies; Nemotron 3 → 273 + 560 We made ModSleuth to trace this. 🧵