English(EN) When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

研究：移除 LLM 中的 LayerNorm 可作为隐式正则化器，其影响取决于训练数据大小。

作者 PulseAugur 编辑部 · [1 个来源] · 2026-04-28 04:00

研究人员调查了从神经网络架构中移除层归一化（LayerNorm）的影响，特别是在 GPT-2 和 Llama 等模型中。他们的发现表明，用学习到的激活边界机制动态双曲正切（DyT）替换 LayerNorm，可以作为一种依赖于训练阶段的隐式正则化器。这意味着 DyT 可以在某些训练阶段（例如，较小的数据集）提高性能，但在其他阶段（例如，较大的数据集或增加模型容量）会降低性能。该研究表明，激活饱和是 DyT 性能的关键因素，其饱和水平因模型大小和训练数据而异。 AI

影响引入了对正则化技术的细致理解，表明像 LayerNorm 替换这样的架构选择具有依赖于训练阶段的效果。

排序理由学术论文，详细介绍了一种用于神经网络的新正则化技术。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Lucky Verma · 2026-04-28 04:00

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

arXiv:2604.23434v1 Announce Type: cross Abstract: Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models s…

报道来源 [1]

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

相关实体

相关话题