LLM pretraining research finds learning rate decay wastes best data

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have identified a critical incompatibility between curriculum-based LLM pretraining and standard learning rate decay schedules. Their findings suggest that decaying learning rates can negate the benefits of training on high-quality data in a specific order. The study proposes two strategies to mitigate this issue: using a more moderate learning rate decay or employing model averaging instead of decay. These methods improved benchmark scores by 1.64% over random shuffling on 1.5B-parameter models trained on 30B tokens. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Co-designing data curricula with optimization methods could unlock performance gains in LLM pretraining.

RANK_REASON Academic paper detailing a novel finding in LLM pretraining methodology.

Read on arXiv cs.CL →

paper
other

COVERAGE [1]

arXiv cs.CL TIER_1 · Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen · 2026-04-27 04:00

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

arXiv:2511.18903v2 Announce Type: replace-cross Abstract: Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-q…

COVERAGE [1]

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

RELATED ENTITIES

RELATED TOPICS