PulseAugur
实时 14:11:09

New optimizers respect neural network symmetries, improve training

Researchers have introduced a new principle for designing optimizers in deep learning that aligns with the inherent symmetries of neural network architectures. Unlike current optimizers like Adam, which operate on parameters in a coordinate-wise manner, the proposed symmetry-compatible optimizers are designed to be equivariant to the specific symmetry groups of different weight blocks. This approach has been applied to various components such as embeddings, LM heads, MLPs, and MoE routers, yielding novel update rules. Experiments on language models demonstrate that these new optimizers consistently improve validation loss and training stability compared to standard AdamW. AI

影响 Introduces novel optimizer designs that improve training stability and final validation loss for language models.

排序理由 The cluster contains an academic paper detailing a new theoretical principle and experimental validation for optimizer design in deep learning.

在 arXiv stat.ML 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

New optimizers respect neural network symmetries, improve training

报道来源 [2]

  1. arXiv stat.ML TIER_1 English(EN) · Tim Tsz-Kit Lau, Weijie Su ·

    Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    arXiv:2605.18106v1 Announce Type: cross Abstract: A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its var…

  2. arXiv stat.ML TIER_1 English(EN) · Weijie Su ·

    Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, renderin…