A new research paper explores the evolution of weight-scale parameters in transformer models during AdamW training. The study derives a three-force decomposition of the squared weight norm, identifying alignment, injection, and decay forces as key drivers. Analysis of Pythia-70M models indicates that alignment force is dominant during the weight-scale growth phase, while alignment and decay forces balance near saturation, leading to relaxation. The researchers also developed a spline displacement method to accurately recover alignment force from sparse checkpoints. AI
IMPACT Provides a deeper understanding of transformer training dynamics, potentially leading to more efficient model optimization techniques.
RANK_REASON The cluster contains a research paper detailing novel analysis of transformer training dynamics. [lever_c_demoted from research: ic=1 ai=1.0]
- AdamW
- alphaXiv
- CatalyzeX
- DagsHub
- Gotit.pub
- Hugging Face
- IArxiv
- Influence Flower
- Pythia 70M
- ScienceCast
- Weibull
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →