New theory explains Transformer generalization delay via Bayesian inference

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-15 09:46

Researchers have proposed a new theory explaining why Transformer models delay generalization after memorizing training data. The theory frames attention mechanisms as implicit Bayesian posteriors over task dependency graphs, suggesting that generalization requires both a suitable MLP capacity and a novel Bayesian structural condition. This condition mandates that attention must assign sufficient weight to all informative tokens, and its failure leads to delayed structural inference, which can be bypassed with specific interventions. AI

影响 Proposes a new theoretical framework for understanding Transformer model behavior, potentially guiding future architectural improvements.

排序理由 The cluster contains an academic paper detailing a new theoretical model for Transformer generalization. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Joseph An · 2026-05-15 09:46

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the tr…

报道来源 [1]

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

相关实体

相关话题