PulseAugur
EN
LIVE 06:57:22

Paper details uniform scaling limits in AdamW-trained transformers

Researchers have published a paper detailing uniform scaling limits in transformers trained with the AdamW optimizer. The study models hidden-state dynamics as an interacting particle system, demonstrating convergence to a forward-backward system of ODEs. This convergence rate is dependent on the transformer's depth and number of heads, with specific mathematical bounds derived that are independent of token count and embedding dimension. AI

IMPACT Provides theoretical insights into transformer scaling, potentially informing future model design and training strategies.

RANK_REASON Academic paper published on arXiv detailing theoretical findings about transformer model scaling.

Read on arXiv stat.ML →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Paper details uniform scaling limits in AdamW-trained transformers

COVERAGE [2]

  1. arXiv stat.ML TIER_1 English(EN) · William Gibson, Christoph Reisinger ·

    Uniform Scaling Limits in AdamW-Trained Transformers

    arXiv:2605.11059v1 Announce Type: new Abstract: We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention he…

  2. arXiv stat.ML TIER_1 English(EN) · Christoph Reisinger ·

    Uniform Scaling Limits in AdamW-Trained Transformers

    We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hid…