PulseAugur
EN
LIVE 05:35:37

Data repetition significantly harms language model performance, research finds

A new research paper published on arXiv explores the detrimental effects of data repetition in language models, particularly in the era of Chinchilla-style scaling laws. The study quantifies the 'Compute-Equivalent Gain' and 'Compute-Equivalent Loss' associated with repetition, revealing that performance peaks at an intermediate repeat count. This damaging repeat count scales with model size, indicating that as models grow, the optimal number of repetitions increases faster than compute. The research demonstrates that even a 10% budget for repeated documents can lead to significant performance degradation, equivalent to using 67% less compute in a no-repetition scenario for a 344M-parameter model. These findings are supported by a statistical model of misspecified linear regression with verbatim duplicates, highlighting a tradeoff between memorization and generalization. AI

IMPACT Quantifies wasted compute in LLM training due to data repetition, guiding better data curation practices.

RANK_REASON Academic paper detailing research findings on language model training data. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Data repetition significantly harms language model performance, research finds

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Jessica Chudnovsky, Joshua Kazdan, Noam Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho ·

    Internal Data Repetition Destroys Language Models

    arXiv:2606.24998v1 Announce Type: new Abstract: Language models are running out of high-quality training data, and even aggressively deduplicated corpora retain some amount of repetition. Earlier controlled studies predated Chinchilla-style scaling laws and could only measure the…