Brief

last 24h

[9/9] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv stat.ML English(EN) · 1w · [2 sources]

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Researchers have developed Ringmaster LMO, a novel asynchronous method for training neural networks that addresses inefficiencies in distributed systems. This approach builds upon the delay-thresholding concept to manage gradient staleness, aiming to improve training speed in heterogeneous environments. The method is designed for unconstrained stochastic non-convex optimization and has demonstrated superior performance compared to existing synchronous and asynchronous baselines in experiments involving quadratic problems and language model pretraining. AI

IMPACT This asynchronous optimization method could accelerate large-scale model training in distributed and heterogeneous computing environments.
RESEARCH · arXiv stat.ML English(EN) · 1w · [2 sources]

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Researchers have introduced a new principle for designing optimizers in deep learning that aligns with the inherent symmetries of neural network architectures. Unlike current optimizers like Adam, which operate on parameters in a coordinate-wise manner, the proposed symmetry-compatible optimizers are designed to be equivariant to the specific symmetry groups of different weight blocks. This approach has been applied to various components such as embeddings, LM heads, MLPs, and MoE routers, yielding novel update rules. Experiments on language models demonstrate that these new optimizers consistently improve validation loss and training stability compared to standard AdamW. AI

IMPACT Introduces novel optimizer designs that improve training stability and final validation loss for language models.
TOOL · arXiv cs.LG English(EN) · 5d

HORST: Composing Optimizer Geometries for Sparse Transformer Training

Researchers have developed HORST, a novel optimizer designed to improve the training of sparse transformers. Standard optimizers struggle to balance the need for sparsity with training stability. HORST addresses this by composing optimizer steps as non-commutative operators, integrating hyperbolic geometry to achieve both stability and L1 sparsity bias. Experiments show HORST significantly outperforms AdamW baselines, especially at higher sparsity levels, across vision and language tasks. AI

IMPACT Enables more efficient training of sparse transformer models, potentially leading to smaller and faster AI systems.
- transformers
- AdamW
- HORST
TOOL · arXiv cs.LG English(EN) · 6d

LionMuon: Alternating Spectral and Sign Descent for Efficient Training

Researchers have introduced LionMuon, a novel optimization algorithm designed for efficient training of large-scale models. This method alternates between the low-cost updates of Lion and the stronger, albeit more expensive, spectral updates of Muon. By sharing a single momentum buffer, LionMuon significantly reduces the average iteration cost while maintaining effectiveness. Experiments show LionMuon outperforms existing optimizers like Muon, Lion, Signum, and AdamW across various model sizes and datasets, achieving lower validation loss with less compute. AI

IMPACT Introduces a new optimization technique that could significantly reduce the computational cost of training large AI models.
- Muon
- AdamW
- Lion
- Signum
- LionMuon
RESEARCH · Hugging Face Daily Papers English(EN) · 5d · [4 sources]

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Researchers have developed new methods for hyperparameter transfer, enabling more efficient scaling of large neural networks. One paper introduces a parameterization justified by dynamical mean-field theory, allowing reliable hyperparameter transfer across models ranging from 51 million to over 2 billion parameters. Another study quantifies hyperparameter transfer and highlights the critical role of the embedding layer's learning rate, suggesting that maximizing it can significantly improve training stability and performance, particularly when using the AdamW optimizer. AI

IMPACT New parameterization and optimization techniques could significantly reduce the cost and complexity of training large-scale AI models.
RESEARCH · arXiv stat.ML English(EN) · 4d · [2 sources]

Anytime Training with Schedule-Free Spectral Optimization

Researchers have developed SF-NorMuon, a new schedule-free spectral optimizer that matches or surpasses the performance of traditional AdamW optimizers. This advancement addresses a key limitation in current anytime training methods, where schedule-free approaches often underperform. SF-NorMuon's ability to achieve high-quality training checkpoints at any point without pre-defined horizons makes it a more practical tool for open-ended continual learning. AI

IMPACT Enables more flexible and efficient neural network training by allowing high-quality checkpoints at any stage without fixed schedules.
- SF-AdamW
- AdamW
- SF-NorMuon
- arXiv
RESEARCH · Hugging Face Daily Papers English(EN) · 6d · [2 sources]

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

A new research paper demonstrates that the choice of optimizer significantly impacts a Transformer model's capacity and scaling laws, even when the architecture remains identical. The study found that the Muon optimizer achieved linear scaling in representation capacity, a 2.3x improvement over AdamW's weaker scaling, particularly in challenging rare-token regimes. This suggests that optimizers should be considered a primary factor in model scaling, alongside architecture and data, and highlights the potential for co-designing optimizers and architectures for better performance. AI

IMPACT Highlights that optimizer choice is a critical, under-explored factor in achieving optimal model scaling and representation capacity.
- Muon
- Transformer
- AdamW
RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [12 sources]

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Researchers have developed several new optimization techniques to improve deep learning model training. AMUSE combines the rapid adaptation of Muon with the stability of Schedule-Free averaging, eliminating the need for learning rate schedules and improving performance across vision and language tasks. Another approach, MiMuon, enhances the generalization capabilities of Muon by blending it with SGD, offering a lower generalization error. Additionally, a new optimizer called Pion addresses Muon's limitations in vision-language-action and reinforcement learning by employing a spectral high-pass filtering mechanism. AI

IMPACT These new optimizers aim to improve training efficiency and generalization for large models, potentially accelerating development in areas like LLMs and robotics.
- YOLO26m
- Muon optimizer
- MiMuon
- Qwen3-0.6B
- AMUSE
- Qwen3
- SGD
- Muon
- AdamW
- Schedule-Free
RESEARCH · arXiv cs.LG English(EN) · 1w · [41 sources]

Accelerated Gradient Descent for Faster Convergence with Minimal Overhead

Several recent research papers explore advanced optimization techniques for machine learning. One paper introduces a derivative-free consensus-based method for nonconvex bi-level optimization, demonstrating convergence guarantees for its mean-field and finite-particle approximations. Another study presents Curvature-Tuned Accelerated Gradient Descent (CT-AGD), which reduces training epochs by an average of 33% for deep learning tasks by capturing local curvature. Additionally, research investigates stochastic approximation algorithms under heavy-tailed noise, analyzing concentration bounds and the impact of noise on error tails. Other papers delve into stochastic gradient variational inference, global convergence of stochastic conic particle gradient descent, and the suboptimality of momentum SGD in nonstationary environments. AI

IMPACT Advances in optimization algorithms are crucial for improving the efficiency and performance of machine learning models.