Learning Rate Transfer in Normalized Transformers

By PulseAugur Editorial · Summary by None from 3 sources

Researchers have developed a new parameterization for Normalized Transformers, termed \"\nu\"GPT, which addresses the issue of learning rate transfer. Unlike the original nGPT, which struggled to maintain optimal learning rates across different model dimensions and token horizons, \"\nu\"GPT demonstrates effective learning rate transfer. This advancement was achieved by combining experimental data with theoretical insights on alignment exponents and modifying the \"\mu\"P approach to hyperparameter transfer. AI

Summary written by None from 3 sources. How we write summaries →

IMPACT Improves hyperparameter transferability in Transformers, potentially streamlining training for larger models.

RANK_REASON Academic paper introducing a novel model parameterization and demonstrating its improved performance.

Read on arXiv stat.ML →

paper
other

COVERAGE [3]

arXiv cs.AI TIER_1 · Boris Shigida, Boris Hanin, Andrey Gromov · 2026-05-01 04:00

Learning Rate Transfer in Normalized Transformers

arXiv:2604.27077v1 Announce Type: cross Abstract: The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size,…
arXiv stat.ML TIER_1 · Andrey Gromov · 2026-04-29 18:11

Learning Rate Transfer in Normalized Transformers

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning ra…
arXiv stat.ML TIER_1 · Andrey Gromov · 2026-04-29 18:11

Learning Rate Transfer in Normalized Transformers

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning ra…

COVERAGE [3]