Brief

last 24h

[4/4] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.LG English(EN) · 3d · [2 sources]

Non-normal spectral signatures of instability in neural network training dynamics

Researchers have developed a new theoretical framework using non-Hermitian operator theory to explain and predict training instabilities in deep neural networks. The study identifies that common optimizers like Adam and SGD with momentum exhibit non-normal update operators, which can lead to transient amplification and loss spikes. The proposed pseudospectral precursor bound, using kappa(V) as an indicator, effectively distinguishes between stable and unstable training phases, outperforming traditional spectral radius measures in experiments. AI

IMPACT Provides a new theoretical lens for understanding and potentially mitigating common training failures in deep learning models.
- Adam
- SGD
RESEARCH · arXiv stat.ML English(EN) · 6d · [2 sources]

Factor Augmented High-Dimensional SGD

Researchers have introduced Factor-Augmented SGD (FSGD), a novel optimization method designed for high-dimensional machine learning tasks. FSGD operates on streaming data, enabling scalability for large-scale problems without requiring full data storage. The method also establishes a theoretical framework for analyzing SGD that accounts for latent factor estimation error, providing moment convergence guarantees. AI

IMPACT Introduces a scalable optimization method for high-dimensional machine learning tasks, potentially improving performance on large datasets.
RESEARCH · Hugging Face Daily Papers English(EN) · 6d · [10 sources]

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Researchers have developed several new optimization techniques to improve deep learning model training. AMUSE combines the rapid adaptation of Muon with the stability of Schedule-Free averaging, eliminating the need for learning rate schedules and improving performance across vision and language tasks. Another approach, MiMuon, enhances the generalization capabilities of Muon by blending it with SGD, offering a lower generalization error. Additionally, a new optimizer called Pion addresses Muon's limitations in vision-language-action and reinforcement learning by employing a spectral high-pass filtering mechanism. AI

IMPACT These new optimizers aim to improve training efficiency and generalization for large models, potentially accelerating development in areas like LLMs and robotics.
- MiMuon
- Muon optimizer
- Qwen3-0.6B
- YOLO26m
- AMUSE
- Qwen3
- SGD
- Muon
- AdamW
- Schedule-Free
RESEARCH · arXiv cs.LG English(EN) · 1w · [27 sources]

Accelerated Gradient Descent for Faster Convergence with Minimal Overhead

Several recent research papers explore advanced optimization techniques for machine learning. One paper introduces a derivative-free consensus-based method for nonconvex bi-level optimization, demonstrating convergence guarantees for its mean-field and finite-particle approximations. Another study presents Curvature-Tuned Accelerated Gradient Descent (CT-AGD), which reduces training epochs by an average of 33% for deep learning tasks by capturing local curvature. Additionally, research investigates stochastic approximation algorithms under heavy-tailed noise, analyzing concentration bounds and the impact of noise on error tails. Other papers delve into stochastic gradient variational inference, global convergence of stochastic conic particle gradient descent, and the suboptimality of momentum SGD in nonstationary environments. AI

IMPACT Advances in optimization algorithms are crucial for improving the efficiency and performance of machine learning models.

Brief

Non-normal spectral signatures of instability in neural network training dynamics

Factor Augmented High-Dimensional SGD

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Accelerated Gradient Descent for Faster Convergence with Minimal Overhead