New research highlights a frequency bias in Stochastic Gradient Descent (SGD) when training language models on imbalanced token distributions. This bias causes parameters for common tokens to converge quickly, while those for rare but important tokens may not receive sufficient updates. The Adam optimizer, through its adaptive learning rate adjustments based on historical gradient statistics, effectively compensates for this imbalance. A controlled experiment using a six-token vocabulary demonstrated how Adam's variance normalization allows rare-token parameters to learn faster than with standard SGD. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT Explains how Adam's adaptive learning mitigates SGD's frequency bias, potentially improving rare token representation in LLMs.
RANK_REASON The cluster describes a research paper analyzing and demonstrating an optimization technique for machine learning models.