English(EN) Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

新研究揭示过早的注意力专业化阻碍语言模型预训练

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-11 13:01

研究人员发现语言模型中存在一种预训练失败模式，即顶层在底层注意力模式稳定之前就过早地专业化。这种“过早的顶层注意力专业化”可以通过在早期训练中暂时减缓这些顶层的Q/K投影来缓解。这种干预措施可以在不改变其他模型参数的情况下提高最终的困惑度（perplexity）和下游任务的准确性，表明了解码器架构和优化之间存在关键的相互作用。 AI

影响识别出解码器语言模型中存在的特定架构和优化缺陷，可通过解决这些缺陷来提高性能。

排序理由学术论文，详细介绍了语言模型预训练中的一项新发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

GPT
LLaMA

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Menglin Yang · 2026-05-11 13:01

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer at…

报道来源 [1]

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

相关实体

相关话题