Researchers have developed a new framework and model, NarraBERT, to analyze narrative structures within large language model (LLM) pretraining data. This analysis, applied to the 3-trillion-token Dolma corpus, reveals measurable, multidimensional narrative patterns related to agency, setting, and events. The findings indicate that narrative qualities are unevenly distributed across different data sources and topics, a factor not currently accounted for in data curation practices. The study's framework, dataset (NarraDolma), and model are being made publicly available to advance the understanding of how data composition influences narrative reasoning in LLMs. AI
IMPACT Provides a new method for understanding and potentially curating LLM training data based on narrative qualities, which could influence model behavior.
RANK_REASON The item describes a research paper detailing a new framework and model for analyzing LLM pretraining data. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →