Taming Curvature: Architecture Warm-Up for Stable Transformer Training
Researchers have developed a new method to stabilize the training of large Transformer models, which are often prone to instability and divergence. The approach, called "architecture warm-up," involves progressively increasing the network depth to manage the preconditioned Hessian, a measure of curvature that correlates with training instabilities. This technique, supported by a fast online estimator for Hessian eigenvalues, has been shown to reduce instabilities without hindering convergence. AI
IMPACT Improves efficiency and reliability of training large-scale Transformer models.