dolma
PulseAugur coverage of dolma — every cluster mentioning dolma across labs, papers, and developer communities, ranked by signal.
1 day(s) with sentiment data
-
New framework analyzes narrative structures in LLM pretraining data
Researchers have developed a new framework and model, NarraBERT, to analyze narrative structures within large language model (LLM) pretraining data. This analysis, applied to the 3-trillion-token Dolma corpus, reveals m…
-
New framework analyzes narrative structure in LLM pretraining data · 4 sources tracked
Researchers have developed a new framework and model, NarraBERT, to analyze narrative structures within large language model (LLM) pretraining data. The study applied this framework to the 3-trillion-token Dolma corpus,…
-
LLM popularity bias driven by pretraining data exposure, study finds
Researchers have analyzed how large language models (LLMs) develop preferences for well-known entities, a phenomenon often linked to popularity bias. Using the open OLMo models and their complete Dolma pretraining corpu…