Paper details uniform scaling limits in AdamW-trained transformers

By PulseAugur Editorial · [2 sources] · 2026-05-11 16:54

Researchers have published a paper detailing uniform scaling limits in transformers trained with the AdamW optimizer. The study models hidden-state dynamics as an interacting particle system, demonstrating convergence to a forward-backward system of ODEs. This convergence rate is dependent on the transformer's depth and number of heads, with specific mathematical bounds derived that are independent of token count and embedding dimension. AI

IMPACT Provides theoretical insights into transformer scaling, potentially informing future model design and training strategies.

RANK_REASON Academic paper published on arXiv detailing theoretical findings about transformer model scaling.

Read on arXiv stat.ML →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

arXiv stat.ML TIER_1 English(EN) · William Gibson, Christoph Reisinger · 2026-05-13 04:00

Uniform Scaling Limits in AdamW-Trained Transformers

arXiv:2605.11059v1 Announce Type: new Abstract: We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention he…
arXiv stat.ML TIER_1 English(EN) · Christoph Reisinger · 2026-05-11 16:54

Uniform Scaling Limits in AdamW-Trained Transformers

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hid…

COVERAGE [2]

Uniform Scaling Limits in AdamW-Trained Transformers

Uniform Scaling Limits in AdamW-Trained Transformers

RELATED ENTITIES

RELATED TOPICS