PulseAugur
EN
LIVE 13:46:46

New Dead-Direction Conditioners optimize deep networks by respecting symmetries

Researchers have developed a new optimization technique called Dead-Direction Conditioners (DDC) designed to improve the training of deep neural networks by respecting their continuous symmetries. Unlike standard optimizers like Adam, DDC explicitly conditions the optimizer's state within the symmetry orbit, ensuring the training trajectory remains on the relevant quotient space. This approach has demonstrated significant benefits in preventing over-training collapse in language models and achieving lower validation loss in vision transformers compared to traditional methods. The DDC technique also shows improved performance in finding optimal solutions, particularly in complex architectures like deep Muon networks. AI

IMPACT This method could lead to more stable and efficient training of large language and vision models, potentially improving performance and reducing computational costs.

RANK_REASON Academic paper detailing a novel method for optimizing deep neural networks. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv stat.ML →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New Dead-Direction Conditioners optimize deep networks by respecting symmetries

COVERAGE [2]

  1. arXiv stat.ML TIER_1 English(EN) · Tejas Pradeep Shirodkar ·

    Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

    arXiv:2606.29176v1 Announce Type: cross Abstract: A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symme…

  2. arXiv stat.ML TIER_1 English(EN) · Tejas Pradeep Shirodkar ·

    Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

    A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symm…