New framework analyzes narrative structures in LLM pretraining data

By PulseAugur Editorial · [1 sources] · 2026-06-17 00:00

Researchers have developed a new framework and model, NarraBERT, to analyze narrative structures within large language model (LLM) pretraining data. This analysis, applied to the 3-trillion-token Dolma corpus, reveals measurable, multidimensional narrative patterns related to agency, setting, and events. The findings indicate that narrative qualities are unevenly distributed across different data sources and topics, a factor not currently accounted for in data curation practices. The study's framework, dataset (NarraDolma), and model are being made publicly available to advance the understanding of how data composition influences narrative reasoning in LLMs. AI

IMPACT Provides a new method for understanding and potentially curating LLM training data based on narrative qualities, which could influence model behavior.

RANK_REASON The item describes a research paper detailing a new framework and model for analyzing LLM pretraining data. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New framework analyzes narrative structures in LLM pretraining data

COVERAGE [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 00:00

Characterizing Narrative Content in Web-scale LLM Pretraining Data

A comprehensive analysis of narrative structures in large-scale language model training data reveals measurable, multidimensional narrative patterns that vary across different content sources and topics.

COVERAGE [1]

Characterizing Narrative Content in Web-scale LLM Pretraining Data

RELATED ENTITIES

RELATED TOPICS