Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Researchers have developed new methods for hyperparameter transfer, enabling more efficient scaling of large neural networks. One paper introduces a parameterization justified by dynamical mean-field theory, allowing reliable hyperparameter transfer across models ranging from 51 million to over 2 billion parameters. Another study quantifies hyperparameter transfer and highlights the critical role of the embedding layer's learning rate, suggesting that maximizing it can significantly improve training stability and performance, particularly when using the AdamW optimizer. AI
IMPACT New parameterization and optimization techniques could significantly reduce the cost and complexity of training large-scale AI models.