Paper details uniform scaling limits in AdamW-trained transformers

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

Researchers have published a paper detailing uniform scaling limits in transformers trained with the AdamW optimizer. The study models hidden-state dynamics as an interacting particle system, demonstrating convergence to a forward-backward system of ODEs. This convergence rate is dependent on the transformer's depth and number of heads, with specific mathematical bounds derived that are independent of token count and embedding dimension. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Provides theoretical insights into transformer scaling, potentially informing future model design and training strategies.

RANK_REASON Academic paper published on arXiv detailing theoretical findings about transformer model scaling.

Read on arXiv stat.ML →

paper
other

COVERAGE [2]

arXiv stat.ML TIER_1 · William Gibson, Christoph Reisinger · 2026-05-13 04:00

Uniform Scaling Limits in AdamW-Trained Transformers

arXiv:2605.11059v1 Announce Type: new Abstract: We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention he…
arXiv stat.ML TIER_1 · Christoph Reisinger · 2026-05-11 16:54

Uniform Scaling Limits in AdamW-Trained Transformers

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hid…

COVERAGE [2]

Uniform Scaling Limits in AdamW-Trained Transformers

Uniform Scaling Limits in AdamW-Trained Transformers

RELATED ENTITIES

RELATED TOPICS