Researchers have identified a critical incompatibility between curriculum-based LLM pretraining and standard learning rate decay schedules. Their findings suggest that decaying learning rates can negate the benefits of training on high-quality data in a specific order. The study proposes two strategies to mitigate this issue: using a more moderate learning rate decay or employing model averaging instead of decay. These methods improved benchmark scores by 1.64% over random shuffling on 1.5B-parameter models trained on 30B tokens. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Co-designing data curricula with optimization methods could unlock performance gains in LLM pretraining.
RANK_REASON Academic paper detailing a novel finding in LLM pretraining methodology.