PulseAugur
实时 14:22:31

Researchers explore weight decay, in-context learning, and acceleration for Transformer models

Researchers have developed several new methods to improve the efficiency and theoretical understanding of Transformer models. One paper provides a functional-analytic characterization of weight decay, demonstrating its role in shaping loss landscapes and improving generalization. Another study investigates how Transformers adapt to different task difficulties during in-context learning, proving optimal convergence rates under distribution shift. Additionally, two papers propose techniques for accelerating Transformer inference: one uses gated subspace inference to reduce memory bandwidth, and the other introduces LEAP, a pretraining objective that enables layer-wise early exits for faster computation. AI

影响 These papers offer theoretical insights into Transformer optimization and introduce novel techniques for accelerating inference, potentially leading to more efficient and capable models.

排序理由 The cluster contains multiple academic papers detailing theoretical advancements and new methods for Transformer models.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

Researchers explore weight decay, in-context learning, and acceleration for Transformer models

报道来源 [7]

  1. arXiv cs.LG TIER_1 English(EN) · James Hensman ·

    Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

    Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Tra…

  2. arXiv cs.LG TIER_1 English(EN) · Abhijit Das, Sayantan Dutta ·

    Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

    arXiv:2605.06599v1 Announce Type: new Abstract: Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic chara…

  3. arXiv cs.LG TIER_1 English(EN) · Tianyi Ma, Tengyao Wang, Richard J. Samworth ·

    Optimal In-context Adaptivity and Distributional Robustness of Transformers

    arXiv:2510.23254v3 Announce Type: replace-cross Abstract: We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution $\pi=\sum_{\alpha\in\mathcal{A}} \lambda_{\alpha} \pi_{\alpha}$, called the pretraining prior, in which eac…

  4. arXiv cs.LG TIER_1 English(EN) · Sayantan Dutta ·

    Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

    Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objectiv…

  5. arXiv cs.LG TIER_1 English(EN) · Stephen J. Thomas ·

    Gated Subspace Inference for Transformer Acceleration

    arXiv:2605.03109v1 Announce Type: new Abstract: A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace compon…

  6. arXiv cs.CL TIER_1 English(EN) · Shashank Kapadia, Deep Naryan Mishra, Sujal Reddy Alugubelli, Haoan Wang, Saipraveen Vabbilisetty, Rishi Bhatia, Anupriya Sharma ·

    LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

    arXiv:2605.01058v1 Announce Type: cross Abstract: Layer-aligned distillation and convergence-based early exit represent two predominant computational efficiency paradigms for transformer inference; yet we establish that they exhibit systematic incompatibility under standard deplo…

  7. arXiv stat.ML TIER_1 English(EN) · Jin Xu, Camille Couturier, Victor R\"uhle, Saravan Rajmohan, James Hensman ·

    Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

    arXiv:2605.07588v1 Announce Type: cross Abstract: Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energ…