English(EN) LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

研究人员探索 Transformer 模型的权重衰减、上下文学习和加速方法

作者 PulseAugur 编辑部 · [7 个来源] · 2026-05-05 04:00

研究人员开发了几种新方法来提高 Transformer 模型的效率和理论理解。一篇论文提供了权重衰减的功能分析表征，展示了其在塑造损失景观和提高泛化能力方面的作用。另一项研究调查了 Transformer 在上下文学习过程中如何适应不同的任务难度，证明了在分布变化下的最优收敛率。此外，两篇论文提出了加速 Transformer 推理的技术：一篇使用门控子空间推理来减少内存带宽，另一篇介绍了 LEAP，一个支持层级早期退出的预训练目标，以实现更快的计算。 AI

影响这些论文提供了对 Transformer 优化的理论见解，并引入了加速推理的新技术，有望带来更高效、更强大的模型。

排序理由该集群包含多篇详细介绍 Transformer 模型理论进展和新方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。我们如何撰写摘要 →

报道来源 [7]

arXiv cs.LG TIER_1 English(EN) · James Hensman · 2026-05-08 11:02

通过因果能量最小化重新审视Transformer层参数化

Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Tra…
arXiv cs.LG TIER_1 English(EN) · Abhijit Das, Sayantan Dutta · 2026-05-08 04:00

权重衰减将 Transformer 损失景观变为 Villani：优化和泛化的函数-分析基础

arXiv:2605.06599v1 Announce Type: new Abstract: Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic chara…
arXiv cs.LG TIER_1 English(EN) · Tianyi Ma, Tengyao Wang, Richard J. Samworth · 2026-05-08 04:00

Transformers 的最优上下文适应性和分布鲁棒性

arXiv:2510.23254v3 Announce Type: replace-cross Abstract: We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution $\pi=\sum_{\alpha\in\mathcal{A}} \lambda_{\alpha} \pi_{\alpha}$, called the pretraining prior, in which eac…
arXiv cs.LG TIER_1 English(EN) · Sayantan Dutta · 2026-05-07 17:22

权重衰减将 Transformer 损失景观变为 Villani：优化和泛化的函数-分析基础

Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objectiv…
arXiv cs.LG TIER_1 English(EN) · Stephen J. Thomas · 2026-05-06 04:00

Transformer 加速的门控子空间推理

arXiv:2605.03109v1 Announce Type: new Abstract: A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace compon…
arXiv cs.CL TIER_1 English(EN) · Shashank Kapadia, Deep Naryan Mishra, Sujal Reddy Alugubelli, Haoan Wang, Saipraveen Vabbilisetty, Rishi Bhatia, Anupriya Sharma · 2026-05-05 04:00

LEAP：用于高效 Transformer 推理的层感知退出预训练

arXiv:2605.01058v1 Announce Type: cross Abstract: Layer-aligned distillation and convergence-based early exit represent two predominant computational efficiency paradigms for transformer inference; yet we establish that they exhibit systematic incompatibility under standard deplo…
arXiv stat.ML TIER_1 English(EN) · Jin Xu, Camille Couturier, Victor R\"uhle, Saravan Rajmohan, James Hensman · 2026-05-11 04:00

通过因果能量最小化重新审视 Transformer 层参数化

arXiv:2605.07588v1 Announce Type: cross Abstract: Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energ…

报道来源 [7]

通过因果能量最小化重新审视Transformer层参数化

权重衰减将 Transformer 损失景观变为 Villani：优化和泛化的函数-分析基础

Transformers 的最优上下文适应性和分布鲁棒性

权重衰减将 Transformer 损失景观变为 Villani：优化和泛化的函数-分析基础

Transformer 加速的门控子空间推理

LEAP：用于高效 Transformer 推理的层感知退出预训练

通过因果能量最小化重新审视 Transformer 层参数化

相关实体

相关话题