A new research paper published on arXiv explores the detrimental effects of data repetition in language models, particularly in the era of Chinchilla-style scaling laws. The study quantifies the 'Compute-Equivalent Gain' and 'Compute-Equivalent Loss' associated with repetition, revealing that performance peaks at an intermediate repeat count. This damaging repeat count scales with model size, indicating that as models grow, the optimal number of repetitions increases faster than compute. The research demonstrates that even a 10% budget for repeated documents can lead to significant performance degradation, equivalent to using 67% less compute in a no-repetition scenario for a 344M-parameter model. These findings are supported by a statistical model of misspecified linear regression with verbatim duplicates, highlighting a tradeoff between memorization and generalization. AI
IMPACT Quantifies wasted compute in LLM training due to data repetition, guiding better data curation practices.
RANK_REASON Academic paper detailing research findings on language model training data. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →