New OptMuon method enhances stochastic optimization with adaptive momentum

arXiv cs.AI TIER_1 English(EN) · Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal · 2026-06-12 04:00

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

arXiv:2606.12808v1 Announce Type: cross Abstract: Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterio…

arXiv cs.LG TIER_1 English(EN) · Meher Bhaskar · 2026-06-11 17:11

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) a…

arXiv cs.LG TIER_1 English(EN) · Handi Zhang, Adrienne M. Propp, Brooks Kinch, Houman Owhadi, Nathaniel Trask · 2026-06-11 04:00

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

arXiv:2606.11650v1 Announce Type: new Abstract: Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary verifica…

arXiv cs.CL TIER_1 English(EN) · Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou · 2026-06-11 04:00

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

arXiv:2606.12370v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to …

arXiv cs.LG TIER_1 English(EN) · Shira Vansover-Hager, Matan Schliserman, Ofir Schlisselberg, Tomer Koren · 2026-06-11 04:00

Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

arXiv:2606.11431v1 Announce Type: new Abstract: Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness…

arXiv cs.LG TIER_1 English(EN) · Benjamin Leblanc, Louis-Jacob Lebel, Teddy Kana, Richard Kamel · 2026-06-11 04:00

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

arXiv:2606.12054v1 Announce Type: new Abstract: Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design ch…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 02:07

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across h…

arXiv cs.CL TIER_1 English(EN) · Jingren Zhou · 2026-06-10 17:36

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 17:36

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, …

arXiv cs.LG TIER_1 English(EN) · Richard Kamel · 2026-06-10 13:19

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we…

arXiv cs.AI TIER_1 English(EN) · Ruinan Wang, Ian Nabney, Mohammad Golbabaee · 2026-06-10 04:00

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

arXiv:2606.10068v1 Announce Type: cross Abstract: Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many low-i…

arXiv cs.LG TIER_1 English(EN) · Mingchen Ma, Guyang Cao, Jelena Diakonikolas, Ilias Diakonikolas · 2026-06-10 04:00

Efficiently Learning Drifting Halfspaces with Massart Noise

arXiv:2606.11149v1 Announce Type: new Abstract: We study the problem of learning a drifting concept in the presence of Massart noise. In this framework, an online learner has access to a history of independent samples whose labels are noisy versions of a target concept that may c…

arXiv cs.LG TIER_1 English(EN) · Ryo Sagawa, Daisuke Furihata, Yuto Miyatake · 2026-06-10 04:00

Accelerating SAV-based optimization via randomized low-rank Hessian approximation

arXiv:2606.10562v1 Announce Type: cross Abstract: We propose a new optimization method, the Nystr\"om-enhanced relaxed scalar auxiliary variable method (N-RSAV), which incorporates curvature information into the RSAV framework to accelerate convergence while preserving an uncondi…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.

arXiv cs.LG TIER_1 English(EN) · Ilias Diakonikolas · 2026-06-09 17:35

Efficiently Learning Drifting Halfspaces with Massart Noise

We study the problem of learning a drifting concept in the presence of Massart noise. In this framework, an online learner has access to a history of independent samples whose labels are noisy versions of a target concept that may change from round to round. The goal is to output…

arXiv cs.LG TIER_1 English(EN) · Jared Lawrence, Ari Kalinsky, Hannah Bradfield, Yair Carmon, Oliver Hinder · 2026-06-09 04:00

The Sample Complexity of Parameter-Free Stochastic Convex Optimization

arXiv:2506.11336v2 Announce Type: replace Abstract: We study the sample complexity of stochastic convex optimization when problem parameters such as the distance to optimality and the Lipschitz constant are unknown. We pursue two strategies. First, we develop a reliable model sel…

arXiv cs.AI TIER_1 English(EN) · Prayas Agrawal, Prateek Chanda, Ishita Khatri, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria · 2026-06-09 04:00

Minibatch Selection via Partition Matroid Constrained Gradient Matching

arXiv:2606.07954v1 Announce Type: cross Abstract: Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage across domains. Existing methods either select samples independently within each domain or rel…

arXiv cs.AI TIER_1 English(EN) · St\'ephane Eilles-Chan Way, Hugo Percot, Quentin Cappart, Tias Guns, Louis-Martin Rousseau · 2026-06-09 04:00

Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition

arXiv:2606.08797v1 Announce Type: cross Abstract: Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational cos…

arXiv cs.AI TIER_1 English(EN) · Nico Daheim, Thomas M\"ollenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan · 2026-06-09 04:00

SVRG and Beyond via Posterior Correction

arXiv:2512.01930v2 Announce Type: replace-cross Abstract: Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections. Originally proposed over a decade ago, these methods have never been connected to any Bayesian method at …

arXiv cs.LG TIER_1 English(EN) · Liping Tao, Xindi Tong, Chee Wei Tan · 2026-06-09 04:00

Learning to Optimize by Differentiable Programming

arXiv:2601.16510v3 Announce Type: replace-cross Abstract: Solving massive-scale optimization problems requires scalable first-order methods with low per-iteration cost. This tutorial highlights a shift in optimization: using differentiable programming not only to execute algorith…

arXiv cs.LG TIER_1 English(EN) · Wentao Zhang, Yutong Zhang, Wentao Mo · 2026-06-09 04:00

Noise-Adaptive High-Probability Regret Bounds for Online Convex Optimization

arXiv:2606.08028v1 Announce Type: new Abstract: We study high-probability regret bounds for online convex optimization (OCO) with strongly convex losses and establish three results that resolve open questions at the intersection of noise adaptivity, feedback structure, and constr…

arXiv cs.LG TIER_1 English(EN) · Binh Nguyen, Trinh Tran, Truong X. Nghiem · 2026-06-09 04:00

LEAF: A Learning-Enabled ADMM Framework for Accelerated Convex Optimization

arXiv:2606.08993v1 Announce Type: new Abstract: We propose LEAF, a learning-enabled ADMM framework for accelerated convex optimization. The key idea is to approximate the Moreau envelope of the objective function using an Input Convex Neural Network (ICNN), resulting in a learned…

arXiv cs.LG TIER_1 English(EN) · Francesco Bullo · 2026-06-09 04:00

Predictive Coding with Bayesian Priors via Proximal Gradients

arXiv:2606.08374v1 Announce Type: cross Abstract: We recast predictive coding as continuous-time proximal gradient descent applied to a regularized maximum-a-posteriori (MAP) objective. We study first a single-level problem and then a multi-level hierarchy. For the single-level p…

arXiv cs.LG TIER_1 English(EN) · Ganzhao Yuan · 2026-06-09 04:00

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

arXiv:2606.08783v1 Announce Type: cross Abstract: Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, existing orthogonalized methods are typically paired with constant or open-lo…

arXiv cs.AI TIER_1 English(EN) · Munsik Kim · 2026-06-08 04:00

A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space

arXiv:2606.07325v1 Announce Type: cross Abstract: We study the minimax rate of estimating a future value $\mu_{t_n+h}$ of a curve $t\mapsto\mu_t$ in the $2$-Wasserstein space $\mathcal{P}_2(\mathbb{R}^d)$ from finitely many noisy snapshots of its past, under an adiabatic bound $\…

arXiv cs.LG TIER_1 English(EN) · Eshed Gal, Samy Wu Fung, Eldad Haber · 2026-06-08 04:00

Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization

arXiv:2603.13546v2 Announce Type: replace Abstract: We introduce Probabilistic Gaussian Homotopy (PGH), a probability-space continuation framework for nonconvex optimization. Unlike classical Gaussian homotopy, which smooths the objective and uniformly averages gradients, PGH def…

arXiv cs.LG TIER_1 English(EN) · Ming Sun, Kun Yuan · 2026-06-08 04:00

Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

arXiv:2606.07496v1 Announce Type: new Abstract: Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communi…

arXiv cs.LG TIER_1 English(EN) · Rohan Shravan · 2026-06-08 04:00

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

arXiv:2606.07404v1 Announce Type: new Abstract: This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense see…

arXiv cs.LG TIER_1 English(EN) · Alma Rahat, Tinkle Chugh, Jonathan Fieldsend, Richard Allmendinger · 2026-06-08 04:00

Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts

arXiv:2606.06984v1 Announce Type: new Abstract: This paper presents a general acceleration mechanism for multi-objective Bayesian optimisation (MOBO) that leverages Gaussian process predictive gradients as auxiliary signals. Rather than replacing existing Pareto-compliant acquisi…

arXiv cs.LG TIER_1 English(EN) · Leonardo Galli, Curtis Fox, Wiebke Bartolomaeus, Mark Schmidt, Holger Rauhut · 2026-06-08 04:00

Flatland: The Adventures of Gradient Descent with Large Step Sizes

arXiv:2606.06722v1 Announce Type: new Abstract: The training of neural networks often entails objective functions that are not globally $L$-smooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step …

arXiv cs.AI TIER_1 English(EN) · Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang · 2026-06-08 04:00

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

arXiv:2510.24561v3 Announce Type: replace-cross Abstract: LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, …

arXiv cs.LG TIER_1 English(EN) · Francesco Bullo · 2026-06-06 23:41

Predictive Coding with Bayesian Priors via Proximal Gradients

We recast predictive coding as continuous-time proximal gradient descent applied to a regularized maximum-a-posteriori (MAP) objective. We study first a single-level problem and then a multi-level hierarchy. For the single-level problem, we show that proximal gradient descent is …

arXiv cs.AI TIER_1 English(EN) · Merve Karakas, Christopher J. Williams, Emmanuel O. Balogun, Sadegh Sadeghi Tabas, Christian Brown, Nikhil Rao · 2026-06-06 04:00

Multi-ResNets for Subspace Preconditioning in Constrained Optimization

arXiv:2606.06300v1 Announce Type: new Abstract: We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through inte…

arXiv cs.LG TIER_1 English(EN) · Kun Yuan · 2026-06-05 17:51

Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the co…

arXiv cs.LG TIER_1 English(EN) · Rohan Shravan · 2026-06-05 15:48

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense seed, through a 5B and a 9B mixture of experts, to …

arXiv cs.AI TIER_1 English(EN) · Munsik Kim · 2026-06-05 14:43

A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space

We study the minimax rate of estimating a future value $μ_{t_n+h}$ of a curve $t\mapstoμ_t$ in the $2$-Wasserstein space $\mathcal{P}_2(\mathbb{R}^d)$ from finitely many noisy snapshots of its past, under an adiabatic bound $\|\nabla_t^k v\|\le\varepsilon$ on the $k$-th covariant…

arXiv cs.LG TIER_1 English(EN) · Disi Lin, Martin Berggren, Tommy L\"ofstedt · 2026-06-05 04:00

Generalized TV--$\ell_p$ Structured Priors for Bayesian $T_1$ Mapping

arXiv:2606.05381v1 Announce Type: new Abstract: We propose an extended family of structured spatial priors that incorporates the total variation (TV) function with $\ell_p$ norms. The prior is proven to be proper and incorporated into a Bayesian regression framework to enable unc…

arXiv cs.LG TIER_1 English(EN) · Dongruo Zhou · 2026-06-05 04:00

Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization

arXiv:2606.05438v1 Announce Type: new Abstract: We study the deterministic first-order oracle complexity of finding $\epsilon$-stationary points in smooth nonconvex optimization when the objective satisfies higher-order smoothness assumptions. While the classical \(\epsilon^{-2…

arXiv cs.LG TIER_1 English(EN) · Christian Coester, Alexa Tudose, Alexander Turoczy · 2026-06-05 04:00

Learning-Augmented Online Minimization with Dual Predictions

arXiv:2606.05380v1 Announce Type: cross Abstract: We present learning-augmented algorithms for two general classes of online minimization problems: metrical task systems and laminar set cover. Both algorithms achieve improved theoretical guarantees using machine-learned predictio…

arXiv cs.LG TIER_1 English(EN) · Andrea Martin, Ian R. Manchester, Luca Furieri · 2026-06-05 04:00

Learning to optimize with guarantees: a complete characterization of linearly convergent algorithms

arXiv:2508.00775v2 Announce Type: replace-cross Abstract: The design of many classical optimization algorithms is driven by the certification of linear convergence rates over classes of optimization problems. In this paper, we consider the problem of improving the average-case pe…

arXiv cs.LG TIER_1 English(EN) · Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin · 2026-06-05 04:00

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

arXiv:2410.02628v5 Announce Type: replace Abstract: Learning conditional distributions $\pi^*(\cdot|x)$ is a central problem in machine learning, which is typically approached via supervised methods with paired data $(x,y) \sim \pi^*$. However, acquiring paired data samples is of…

arXiv cs.AI TIER_1 English(EN) · Nikhil Rao · 2026-06-04 15:37

Multi-ResNets for Subspace Preconditioning in Constrained Optimization

We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. T…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 10:10

Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. W…

arXiv cs.LG TIER_1 English(EN) · Aleksandar Armacki, Dragana Bajovi\'c, Du\v{s}an Jakoveti\'c, Soummya Kar, Ali H. Sayed · 2026-06-04 04:00

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

arXiv:2602.05657v2 Announce Type: replace Abstract: The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees,…

arXiv cs.LG TIER_1 English(EN) · Julius Durmann, Amelie Kleber · 2026-06-04 04:00

Mean-based algorithms: A lower bound and regret

arXiv:2606.04931v1 Announce Type: new Abstract: Mean-based algorithms are a class of online learning algorithms that assign low probability to actions with low average rewards. Recent work indicates these algorithms converge favorably to serially undominated actions, which approx…

arXiv cs.CL TIER_1 English(EN) · Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye, Amirali Abdullah, Bernhard Sch\"olkopf, Zhijing Jin · 2026-06-04 04:00

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

arXiv:2606.05165v1 Announce Type: cross Abstract: Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated re…

arXiv cs.LG TIER_1 English(EN) · Zhijing Jin · 2026-06-03 17:59

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large …

arXiv cs.LG TIER_1 English(EN) · Amelie Kleber · 2026-06-03 14:23

Mean-based algorithms: A lower bound and regret

Mean-based algorithms are a class of online learning algorithms that assign low probability to actions with low average rewards. Recent work indicates these algorithms converge favorably to serially undominated actions, which approximate Nash equilibria in economic games. However…

arXiv cs.LG TIER_1 English(EN) · Luo Luo, Xue Cui, Tingkai Jia, Cheng Chen · 2026-06-03 04:00

Decentralized Stochastic Nonconvex Optimization under the $(L_0,L_1)$-Smoothness

arXiv:2509.08726v3 Announce Type: replace-cross Abstract: This paper focuses on the decentralized stochastic optimization problem $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$ over a connected network of $n$ agents, where each local function has the form of $f_i(\mathbf…

arXiv cs.LG TIER_1 English(EN) · Moses Charikar, Chirag Pabbaraju, Ambuj Tewari · 2026-06-03 04:00

From Non-Convex to Strongly Convex: Curvature-Adaptive FTPL for Online Optimization

arXiv:2606.02948v1 Announce Type: new Abstract: Curvature adaptivity is a classical theme in online optimization: for convex Lipschitz losses, adaptive methods interpolate between the optimal $O(\sqrt{T})$ regret for general convex losses and $O(\log T)$ regret under strong conve…

arXiv cs.AI TIER_1 English(EN) · Han Fang, Paul Weng, Yutong Ban · 2026-06-03 04:00

ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

arXiv:2501.17377v4 Announce Type: replace-cross Abstract: Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

STRIDE framework enables efficient training data attribution for LLMs by modeling functional effects in activation space through sparse recovery and steering operators, achieving superior speed and accuracy compared to traditional gradient-based methods.

arXiv cs.LG TIER_1 English(EN) · Yue Wu, Weiqiang Zheng, Yang Cai, Haipeng Luo · 2026-06-02 04:00

Accelerating Min-Max Optimization via Power-Law Stepsizes

arXiv:2606.01764v1 Announce Type: cross Abstract: We revisit the convergence guarantees of the Extragradient (EG) method for unconstrained biaffine min-max optimization. It is known that EG with a fixed stepsize achieves a $\Theta(T^{-1/2})$ last-iterate convergence rate, which i…

arXiv cs.LG TIER_1 English(EN) · Nicholas Knight · 2026-06-02 04:00

Riemannian Gradient Descent for Low-Rank Architectures

arXiv:2606.02328v1 Announce Type: new Abstract: We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-$r$ matrices, three g…

arXiv cs.LG TIER_1 English(EN) · Matthew Regehr, Gautam Kamath, Andrew Lowy · 2026-06-02 04:00

Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses

arXiv:2606.01527v1 Announce Type: new Abstract: Machine unlearning is motivated by legal and user-facing requirements to remove the influence of individuals' data from trained models, such as the right to be forgotten. Prior work has developed algorithms and error bounds for unle…

arXiv cs.LG TIER_1 English(EN) · Gishnu Madhu, Feng Liu, Souma Chowdhury · 2026-06-02 04:00

Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization

arXiv:2606.01425v1 Announce Type: new Abstract: Mixed-combinatorial nonlinear programming (MCNLP) problems arise in many engineering design and planning applications, e.g., due to categorical, component, and geometric design choices, as well as joint task and motion planning. Tra…

arXiv cs.LG TIER_1 English(EN) · Shion Takeno · 2026-06-02 04:00

Optimal-Point Variance Reduction For Bayesian Optimization With Regret Guarantee

arXiv:2606.00956v1 Announce Type: new Abstract: This paper studies a one-step lookahead Bayesian optimization (BO) method and its theoretical guarantee. Although the empirical effectiveness of one-step lookahead BO methods, such as entropy search, has been studied extensively, th…

arXiv cs.LG TIER_1 English(EN) · Bing Liu, Wenjie Zhou, Chengcheng Zhao · 2026-06-02 04:00

Rethinking Bregman Divergences in Kronecker-Factored Optimizers

arXiv:2606.00542v1 Announce Type: new Abstract: Shampoo-style optimizers approximate gradient covariance matrices using Kronecker-factored structures. Recent work~\cite{lin2026understanding} showed that such approximations can be viewed as projections under Bregman matrix diverge…

arXiv cs.AI TIER_1 English(EN) · Shuhei Watanabe, Frank Hutter · 2026-06-02 04:00

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

arXiv:2211.14411v5 Announce Type: replace-cross Abstract: Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as on memory usage or latency, on top of the performance requi…

arXiv cs.AI TIER_1 English(EN) · Mohammad Rashed, Duarte F. Valoroso Madeira, Babak Gholami, Caglar Guerbuez, Yunjia Yang, Nils Thuerey · 2026-06-02 04:00

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

arXiv:2606.02179v1 Announce Type: cross Abstract: Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains u…

arXiv cs.AI TIER_1 English(EN) · Munsik Kim · 2026-06-02 04:00

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

arXiv:2606.00703v1 Announce Type: cross Abstract: Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability -- algorithms and empirical scaling laws -- with no matching characterization of what …

arXiv cs.AI TIER_1 English(EN) · Dongjun Kim, Adrian de Wynter, Huancheng Chen, Heasung Kim, Haris Vikalo · 2026-06-02 04:00

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

arXiv:2606.00132v1 Announce Type: cross Abstract: While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through speci…

arXiv cs.AI TIER_1 English(EN) · Yi-Xiang Hu · 2026-06-02 04:00

Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations

arXiv:2606.00002v1 Announce Type: new Abstract: Mixed-Integer Linear Programming (MILP) decision engines routinely output nominally optimal plans for high-stakes industrial systems. Yet deployment rarely matches solve-time assumptions: small perturbations in costs, demands, or re…

arXiv cs.LG TIER_1 English(EN) · Chengfeng Wu, Tao Zou, Yanru Wu, Jingge Wang · 2026-06-02 04:00

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

arXiv:2606.02221v1 Announce Type: cross Abstract: Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify th…

arXiv cs.LG TIER_1 English(EN) · Minduli Wijayatunga, Roberto Armellin · 2026-06-02 04:00

Tiny Recursive Models for Solving the J2-Perturbed Lambert Problem

arXiv:2606.00895v1 Announce Type: cross Abstract: This paper presents a fast, recursive neural solver for the J2-perturbed Lambert problem based on Tiny Recursive Models (TRM), termed the TRM-Perturbed Lambert (TRM-PL) model. TRM is a weight-shared architecture whose effective ca…

arXiv cs.LG TIER_1 Deutsch(DE) · Dingzhi Yu, Wei Jiang, Hongyi Tao, Yuanyu Wan, Lijun Zhang · 2026-06-02 04:00

Mirror Descent Under Generalized Smoothness

arXiv:2502.00753v4 Announce Type: replace-cross Abstract: Smoothness is crucial for attaining fast rates in first-order optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by al…

arXiv cs.LG TIER_1 English(EN) · Edwige Cyffers, Alireza Mirrokni, Marco Mondelli · 2026-06-02 04:00

Optimal Regularization for Performative Learning

arXiv:2510.12249v2 Announce Type: replace Abstract: In performative learning, the data distribution reacts to the deployed model - for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. On…

arXiv cs.LG TIER_1 English(EN) · Jiayu Zhang, Tianyi Lin · 2026-06-02 04:00

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

arXiv:2605.18528v2 Announce Type: replace-cross Abstract: A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only supp…

arXiv cs.LG TIER_1 English(EN) · Nicholas Knight · 2026-06-01 14:40

Riemannian Gradient Descent for Low-Rank Architectures

We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-$r$ matrices, three geometries for rank-$r$ partial isometries, and b…

arXiv cs.LG TIER_1 English(EN) · Jingge Wang · 2026-06-01 13:20

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approache…

arXiv cs.AI TIER_1 English(EN) · Nils Thuerey · 2026-06-01 12:36

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is gov…

arXiv cs.AI TIER_1 English(EN) · Zeou Hu, Kelvin Ho, Yaoliang Yu · 2026-06-01 04:00

A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

arXiv:2605.30452v1 Announce Type: cross Abstract: Many machine learning problems involve multiple inherent trade-offs that are best addressed by gradient-based multi-objective optimization (MOO) algorithms. Existing methods are often proposed with various motivations, analyzed ca…

arXiv cs.AI TIER_1 English(EN) · Yansen Zhang, Qingcan Kang, Yujie Chen, Yufei Wang, Xiongwei Han, Tao Zhong, Mingxuan Yuan, Chen Ma · 2026-06-01 04:00

SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling

arXiv:2510.05115v3 Announce Type: replace Abstract: Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain …

arXiv cs.LG TIER_1 English(EN) · Sharan Vaswani, Yifan Sun, Reza Babanezhad · 2026-06-01 04:00

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

arXiv:2605.30648v1 Announce Type: new Abstract: Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is …

arXiv cs.LG TIER_1 English(EN) · Abhishek Chakraborty, Angelia Nedi\'c · 2026-06-01 04:00

Randomized Feasibility Methods for Constrained Optimization with Adaptive Step Sizes

arXiv:2601.20076v2 Announce Type: replace-cross Abstract: We consider minimizing an objective function subject to constraints defined by the intersection of lower-level sets of convex functions. We study two cases: (i) strongly convex and Lipschitz-smooth objective function and (…

arXiv cs.LG TIER_1 English(EN) · Shengyu Feng, Tarun Suresh, Yiming Yang · 2026-06-01 04:00

Unsupervised Diffusion Solver for Combinatorial Optimization via Combinatorial Adjoint Matching

arXiv:2605.30920v1 Announce Type: new Abstract: Diffusion-based neural solvers have shown strong promise for combinatorial optimization (CO), but existing methods typically rely on supervised training with large collections of near-optimal solutions. In this work, we extend adjoi…

arXiv cs.LG TIER_1 English(EN) · Junbin Qiu, Zhaowei Hong, Renzhe Xu, Yao Shu · 2026-06-01 04:00

Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens

arXiv:2605.30960v1 Announce Type: new Abstract: Accurate Zeroth-Order (ZO) Hessian estimation is a cornerstone of derivative-free methods, essential for tasks such as bilevel optimization, Bayesian inference, and uncertainty quantification. However, obtaining a complete suite of …

arXiv cs.LG TIER_1 English(EN) · Zihao Chen · 2026-06-01 04:00

A Unifying View of Anchoring via Operator-Side Tikhonov Regularization

arXiv:2605.30905v1 Announce Type: cross Abstract: Anchored fixed point and monotone equation methods, including Halpern iteration, extra anchored gradient, and their relatives, add a vanishing pull toward a reference point to obtain last-iterate guarantees. Existing anchored vari…

arXiv cs.LG TIER_1 English(EN) · Ferhat Erata, Orr Paradise, Thanos Typaldos, Timos Antonopoulos, ThanhVu Nguyen, Shafi Goldwasser, Ruzica Piskac · 2026-06-01 04:00

Learning Randomized Reductions

arXiv:2412.18134v4 Announce Type: replace Abstract: Randomized self-reductions (RSRs) express $f(x)$ using $f$ evaluated at random correlated points, enabling self-correcting programs, instance-hiding protocols, and applications in complexity theory and cryptography. Yet discover…

arXiv cs.LG TIER_1 English(EN) · Qian Xie, Linda Cai, Alexander Terenin, Peter I. Frazier, Ziv Scully · 2026-06-01 04:00

Cost-aware Stopping for Bayesian Optimization

arXiv:2507.12453v5 Announce Type: replace Abstract: In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions in a cost-aware manner is an important but underexplored practi…

arXiv cs.LG TIER_1 English(EN) · Dai Hai Nguyen, Duc Dung Nguyen, Atsuyoshi Nakamura, Hiroshi Mamitsuka · 2026-06-01 04:00

Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization

arXiv:2601.19220v2 Announce Type: replace Abstract: We study multi-objective optimization over probability distributions in Wasserstein space. Recently, Nguyen et al. (2025) introduced Multiple Wasserstein Gradient Descent (MWGraD) algorithm, which exploits the geometric structur…

arXiv cs.LG TIER_1 English(EN) · Yaohong Yang, Sammie Katt, Samuel Kaski · 2026-06-01 04:00

Multi-Objective Bayesian Optimization via Adaptive \varepsilon-Constraints Decomposition

arXiv:2604.15959v2 Announce Type: replace Abstract: Multi-objective Bayesian optimization (MOBO) provides a principled framework for optimizing multiple expensive black-box functions. However, existing MOBO methods often struggle with coverage, scalability, and handling constrain…

arXiv cs.LG TIER_1 English(EN) · Hua Li · 2026-05-29 04:00

Gradient Perturbation: Learning to Perturb Gradients for Adaptive Training

arXiv:2605.29494v1 Announce Type: new Abstract: Deep neural network training involves both forward propagation (from features through logits to loss) and backward propagation (from loss through gradients to parameter updates). While perturbations along the forward chain, includin…

arXiv cs.LG TIER_1 English(EN) · Shutong Ding, Yimiao Zhou, Ke Hu, Xi Yao, Junchi Yan, Xiaoying Tang, Ye Shi · 2026-05-29 04:00

Diffusion-based learning framework for Constrained Nonconvex Optimization with Weighted Bootstrapped Refinement

arXiv:2502.10330v4 Announce Type: replace Abstract: Recent advances in diffusion models show promising potential to accelerate nonconvex problem solving by leveraging their multimodality. However, most existing diffusion-based optimization approaches rely on supervised learning a…

arXiv cs.LG TIER_1 English(EN) · Jisung Hwang, Minhyuk Sung · 2026-05-29 04:00

Gradient Preconditioning for Efficient and Reliable Reward-Guided Generation

arXiv:2602.08646v2 Announce Type: replace Abstract: We propose a gradient preconditioning method that makes reward-guided generation with one-step generative models both efficient and reliable. Test-time noise optimization can unlock substantially better reward-guided generations…

arXiv cs.LG TIER_1 English(EN) · Th\'eotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier · 2026-05-29 04:00

Accelerating trajectory optimization with Sobolev-trained diffusion policies

arXiv:2604.19011v2 Announce Type: replace Abstract: Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, converge…

arXiv cs.AI TIER_1 English(EN) · Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang · 2026-05-29 04:00

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

arXiv:2605.29547v1 Announce Type: cross Abstract: Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. I…

arXiv cs.LG TIER_1 English(EN) · Luxuan Li, Chunfeng Cui, Xiao Wang · 2026-05-29 04:00

MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized Optimization

arXiv:2605.29635v1 Announce Type: cross Abstract: In this paper, we study a structured class of nonconvex constrained stochastic problems with difference-of-convex (DC) regularization, where the feasible set is possibly nonconvex and the concave part of the DC regularizer is allo…

arXiv cs.LG TIER_1 English(EN) · Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich · 2026-05-28 04:00

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

arXiv:2602.06880v2 Announce Type: replace Abstract: Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral o…

arXiv cs.LG TIER_1 English(EN) · Ivan Bioli, Carlo Marcati, Giancarlo Sangalli · 2026-05-28 04:00

Accelerating Natural Gradient Descent for PINNs with Randomized Numerical Linear Algebra

arXiv:2505.11638v4 Announce Type: replace-cross Abstract: Natural Gradient Descent (NGD) has emerged as a promising optimization algorithm for training neural network-based solvers for partial differential equations (PDEs), such as Physics-Informed Neural Networks (PINNs). Howeve…

arXiv cs.LG TIER_1 English(EN) · Sara Gjorgjieva, Eva Tuba, Tome Eftimov · 2026-05-28 04:00

Learning to Assess the Reliability of Number-of-Runs Estimation in Stochastic Optimization

arXiv:2605.28309v1 Announce Type: new Abstract: In large-scale benchmarking of stochastic optimization algorithms, the key challenge is no longer whether repeated runs are needed for reliability, but how to determine when sufficient evidence has been collected without incurring u…

arXiv cs.LG TIER_1 English(EN) · Jonas Hanselle, Valentin Margraf, Clemens Damke, Eyke H\"ullermeier · 2026-05-28 04:00

Unification and Optimization of Robust Supervised Learning

arXiv:2605.28165v1 Announce Type: new Abstract: The literature has proposed various robust alternatives to empirical risk minimisation to address failure modes such as distribution shift, label noise and finite-sample degeneracies. Examples include distributionally robust optimiz…

arXiv cs.LG TIER_1 English(EN) · Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich · 2026-05-28 04:00

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

arXiv:2605.27733v1 Announce Type: new Abstract: Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives …

arXiv cs.LG TIER_1 English(EN) · Mohammed Adnan, Rohan Jain, Tom Jacobs, Ekansh Sharma, Rahul G. Krishnan, Rebekka Burkholz, Yani Ioannou · 2026-05-28 04:00

SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training

arXiv:2605.27541v1 Announce Type: new Abstract: Dynamic Sparse Training (DST) methods train neural networks by maintaining sparsity while dynamically adapting the network topology. Despite the promise of reduced computation, DST methods converge significantly slower than dense tr…

arXiv cs.AI TIER_1 English(EN) · Tinghan Ye, Arnaud Deza, Ved Mohan, El Mehdi Er Raqabi, Pascal Van Hentenryck · 2026-05-28 04:00

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

arXiv:2605.18692v2 Announce Type: replace Abstract: Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules and unforeseen p…

arXiv cs.AI TIER_1 English(EN) · Yunwen Lei, Zimeng Wang, Xiaoming Yuan · 2026-05-28 04:00

Stochastic Gradient Descent with Momentum is Algorithmically Stable

arXiv:2605.28517v1 Announce Type: cross Abstract: Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insuffi…

arXiv cs.AI TIER_1 English(EN) · Teodor-Mihai Stupariu, Andrei Manolache · 2026-05-28 04:00

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

arXiv:2605.27662v1 Announce Type: cross Abstract: Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less constrained architectures. A growing body of work addresses this through architectural mod…

arXiv cs.AI TIER_1 English(EN) · Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell · 2026-05-28 04:00

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

arXiv:2605.27619v1 Announce Type: cross Abstract: Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. Whi…

arXiv cs.AI TIER_1 English(EN) · Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel J. Kochenderfer · 2026-05-28 04:00

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

arXiv:2605.27996v1 Announce Type: new Abstract: Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias sub…

arXiv cs.AI TIER_1 English(EN) · Xiaoming Yuan · 2026-05-27 14:17

Stochastic Gradient Descent with Momentum is Algorithmically Stable

Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can gener…

arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Tome Eftimov · 2026-05-27 11:08

Learning to Assess the Reliability of Number-of-Runs Estimation in Stochastic Optimization

In large-scale benchmarking of stochastic optimization algorithms, the key challenge is no longer whether repeated runs are needed for reliability, but how to determine when sufficient evidence has been collected without incurring unnecessary computational cost. We study a learni…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 04:37

Is Backpropagation Optimal? When Synthetic Gradients Improve Sample Efficiency

Backpropagation is the default learning rule for artificial neural networks and is often treated as the settled approach whenever differentiability is available. In this work, we revisit this convention through a theoretical lens of sample efficiency. We introduce a unified vecto…

arXiv cs.LG TIER_1 English(EN) · Yixuan Yang, Yuqing He, Song Li · 2026-05-27 04:00

Convergence of Spectral Descent for Non-smooth Optimization

arXiv:2605.26977v1 Announce Type: new Abstract: The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heav…

arXiv cs.LG TIER_1 English(EN) · Dmitry Kovalev · 2026-05-27 04:00

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

arXiv:2605.15522v2 Announce Type: replace-cross Abstract: Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of genera…

arXiv cs.LG TIER_1 English(EN) · Fabian Schaipp, Robert M. Gower, Adrien Taylor · 2026-05-27 04:00

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

arXiv:2602.09842v2 Announce Type: replace-cross Abstract: We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as …

arXiv cs.LG TIER_1 English(EN) · Kartik Gupta, Stephen D. Miller, Pradeep Ravikumar, Ramarathnam Venkatesan · 2026-05-27 04:00

Stochastic global optimization of continuous functions via random walks on Grassmannians

arXiv:2605.14151v1 Announce Type: cross Abstract: We introduce a stochastic global optimization method based on random walks on Grassmannian manifolds. To minimize a continuous objective $\ell:\mathbb{R}^d\rightarrow\mathbb{R}$, the method repeatedly samples random $k$-dimensiona…

arXiv cs.LG TIER_1 English(EN) · Kukyoung Jang, Taehyun Cho, Junrui Zhang, Ping Xu, Kyungjae Lee · 2026-05-27 04:00

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

arXiv:2605.27316v1 Announce Type: new Abstract: Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a gen…

arXiv cs.LG TIER_1 English(EN) · Kyungjae Lee · 2026-05-26 17:25

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a general smoothing framework that combines flexible …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-26 17:25

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a general smoothing framework that combines flexible …

arXiv cs.LG TIER_1 English(EN) · Song Li · 2026-05-26 13:02

Convergence of Spectral Descent for Non-smooth Optimization

The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-s…

arXiv cs.LG TIER_1 English(EN) · Ziyue Chen, David \v{S}i\v{s}ka, Lukasz Szpruch · 2026-05-26 04:00

Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

arXiv:2605.24939v1 Announce Type: new Abstract: We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function appro…

arXiv cs.LG TIER_1 English(EN) · Matan Schliserman, Shira Vansover-Hager, Tomer Koren · 2026-05-26 04:00

Flat Minima and Generalization: Insights from Stochastic Convex Optimization

arXiv:2511.03548v2 Announce Type: replace Abstract: Understanding the generalization behavior of learning algorithms is a central goal of learning theory. A recently emerging explanation is that learning algorithms are successful in practice because they converge to flat minima, …

arXiv cs.LG TIER_1 English(EN) · Enea Monzio Compagnoni, Rustem Islamov, Frank Norbert Proske, Aurelien Lucchi, Antonio Orvieto, Eduard Gorbunov · 2026-05-26 04:00

On the Interaction of Batch Noise, Adaptivity, and Compression, under $(L_0,L_1)$-Smoothness: An SDE Approach

arXiv:2506.00181v2 Announce Type: replace Abstract: Distributed stochastic optimization intertwines (i) stochastic gradient noise, (ii) communication compression, and (iii) adaptive/normalized updates. While each factor has been studied in isolation, their joint effect under real…

arXiv cs.LG TIER_1 English(EN) · Jose Blanchet, Peter Glynn, Wenhao Yang · 2026-05-26 04:00

Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

arXiv:2605.26000v1 Announce Type: cross Abstract: Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients hav…

arXiv cs.LG TIER_1 English(EN) · Khen Cohen, Mark Glass, Meir Feder, Yaron Oz · 2026-05-26 04:00

Implicit Binarization via Complex Phase Dynamics in Combinatorial Optimization

arXiv:2605.24502v1 Announce Type: cross Abstract: We introduce a physics-inspired continuous relaxation framework that yields substantially improved solutions for NP-hard combinatorial optimization problems, including Quadratic Unconstrained Binary Optimization (QUBO), binary spa…

arXiv cs.LG TIER_1 English(EN) · Chung-Yiu Yau, Dawei Li, Athanasios Glentis, Valentyn Boreiko, Hoi-To Wai, Mingyi Hong · 2026-05-26 04:00

EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

arXiv:2605.25395v1 Announce Type: new Abstract: Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. I…

arXiv cs.LG TIER_1 English(EN) · Yudong W. Xu, Wenhao Li, Xiaoyu Wang, Scott Sanner, Elias B. Khalil · 2026-05-26 04:00

Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization

arXiv:2605.25129v1 Announce Type: new Abstract: Diffusion models have shown promise in learning to solve constraint optimization problems. However, they are mostly restricted to problems with binary variables and rely on graph neural networks, hindering their application to a bro…

arXiv cs.LG TIER_1 English(EN) · Zhuanghua Liu, Luo Luo · 2026-05-26 04:00

Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

arXiv:2605.24513v1 Announce Type: new Abstract: This paper considers the nonconvex nonsmooth problem in which the objective function is Lipschitz continuous. We focus on the stochastic setting where the algorithm can access stochastic function value evaluations with heavy-tailed …

arXiv cs.AI TIER_1 English(EN) · Chen Liang, Xiatao Sun, Qian Wang, Daniel Rakita · 2026-05-26 04:00

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

arXiv:2605.14373v2 Announce Type: replace-cross Abstract: Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they…

arXiv cs.AI TIER_1 English(EN) · Haoyu Huang, Boyu Liu, Linlin Yang, Yanjing Li, Yuguang Yang, Xuhui Liu, Canyu Chen, Zhongqian Fu, Baochang Zhang · 2026-05-26 04:00

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

arXiv:2605.10989v3 Announce Type: replace-cross Abstract: The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Throug…

arXiv cs.AI TIER_1 English(EN) · Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee · 2026-05-26 04:00

EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization

arXiv:2508.12479v2 Announce Type: replace-cross Abstract: Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc. For these problems, gradient-based methods are well understood and enjoy strong guarantees. However, in the absence of con…

arXiv cs.AI TIER_1 English(EN) · Huangyu Xu, Jingqin Yang, Qianqian Xu, Jiaye Teng · 2026-05-26 04:00

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

arXiv:2605.25134v1 Announce Type: cross Abstract: Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ regularization. However, it may encounter optimization instability due to the unbounded gradie…

arXiv cs.LG TIER_1 English(EN) · Yequan Zhao, Ruijie Zhang, Liyan Tan, Niall Moran, Tong Qin, Zheng Zhang · 2026-05-25 04:00

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

arXiv:2605.22869v1 Announce Type: new Abstract: Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limite…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 03:39

EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short…

arXiv cs.LG TIER_1 English(EN) · Alexander Tyurin · 2026-05-22 04:00

Near-Optimal Convergence of Accelerated Gradient Methods under Generalized and $(L_0, L_1)$-Smoothness

arXiv:2508.06884v2 Announce Type: replace-cross Abstract: We study first-order methods for convex optimization problems with functions $f$ satisfying the recently proposed $\ell$-smoothness condition $||\nabla^{2}f(x)|| \le \ell\left(||\nabla f(x)||\right),$ which generalizes the…

arXiv cs.LG TIER_1 English(EN) · Zhuo Chen (equal contribution), Xinzhe Yuan (equal contribution), Jianshu Zhang (Shanghai Artificial Intelligence Laboratory, Shanghai, China, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China), Jinzong Dong (Shanghai Artificial … · 2026-05-22 04:00

LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation

arXiv:2605.22054v1 Announce Type: new Abstract: The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledge-driven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directl…

arXiv cs.LG TIER_1 English(EN) · Ryan Cory-Wright, Jean Pauphilet · 2026-05-22 04:00

Compact Lifted Relaxations for Low-Rank Optimization

arXiv:2603.20228v2 Announce Type: replace-cross Abstract: We develop tractable convex relaxations for rank-constrained quadratic optimization problems over $n \times m$ matrices, a setting for which tractable relaxations are typically only available when the objective or constrai…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-19 22:11

Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usu…

arXiv cs.LG TIER_1 English(EN) · Jalal Etesami · 2026-05-19 11:00

Convergence of Consensus-Based Particle Methods for Nonconvex Bi-Level Optimization

In this paper, we study a consensus-based optimization method for nonconvex bi-level optimization, where the objective is to minimize an upper-level function over the set of global minimizers of a lower-level problem. The proposed approach is derivative-free, and constructs its c…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-19 11:00

Convergence of Consensus-Based Particle Methods for Nonconvex Bi-Level Optimization

In this paper, we study a consensus-based optimization method for nonconvex bi-level optimization, where the objective is to minimize an upper-level function over the set of global minimizers of a lower-level problem. The proposed approach is derivative-free, and constructs its c…

arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Shinichi Shirakawa · 2026-05-18 06:31

Adaptive Stochastic Natural Gradient Method for Safe Optimization on Binary Space

Optimization problems in real-world applications across the medical and engineering domains often involve potential risks when evaluating candidate solutions. Safe optimization aims to perform optimization while suppressing unsafe solution evaluations in such situations. For cont…

arXiv cs.LG TIER_1 English(EN) · Frank Liu · 2026-05-15 14:50

Accelerated Gradient Descent for Faster Convergence with Minimal Overhead

In this paper, we present CT-AGD (Curvature-Tuned Accelerated Gradient Descent), an optimization method for non-convex optimization problems in deep learning training tasks. CT-AGD is a general boosting procedure that accelerates first-order methods by explicitly capturing the lo…

arXiv stat.ML TIER_1 English(EN) · Kate\v{r}ina Henclov\'a, V\'aclav \v{S}m\'idl · 2026-06-12 04:00

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

arXiv:2602.08913v2 Announce Type: replace-cross Abstract: High-dimensional, underdetermined and highly correlated systems are common in data science practice, especially when analyzing physical measurements. In such settings, feature selection poses a fundamental challenge becaus…

arXiv stat.ML TIER_1 English(EN) · James Cuin, Davide Carbone, Yanbo Tang, O. Deniz Akyildiz · 2026-06-12 04:00

Efficient Stochastic Optimisation via Sequential Monte Carlo

arXiv:2601.22003v2 Announce Type: replace Abstract: The problem of optimising functions with intractable gradients frequently arises in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic …

arXiv stat.ML TIER_1 English(EN) · Dimitra Maoutsa · 2026-06-12 04:00

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

arXiv:2512.23566v2 Announce Type: replace-cross Abstract: How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geom…

arXiv stat.ML TIER_1 English(EN) · Noah Golowich, Ankur Moitra, Dhruv Rohatgi · 2026-06-11 04:00

The Power of Test-Time Training for Approximate Sampling

arXiv:2606.11437v1 Announce Type: cross Abstract: Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been…

arXiv stat.ML TIER_1 English(EN) · Susmit Sarkar, Abhinav Raghuvanshi, Kushal Chakrabarti, Mayank Baranwal · 2026-06-11 04:00

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

arXiv:2606.11339v1 Announce Type: cross Abstract: We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global …

arXiv stat.ML TIER_1 English(EN) · Junzhuo Gao, Ling Peng, Xu Guo, Heng Lian · 2026-06-11 04:00

Renewable Lasso without Batch-Number Constraints: A Gradient-Enhanced Approach

arXiv:2606.11738v1 Announce Type: new Abstract: We study online estimation for high-dimensional generalized linear models with streaming data. First, for the non-distributed setting, we propose a gradient-enhanced surrogate loss that approximates the cumulative loss using only hi…

arXiv stat.ML TIER_1 English(EN) · Heng Lian · 2026-06-10 07:15

Renewable Lasso without Batch-Number Constraints: A Gradient-Enhanced Approach

We study online estimation for high-dimensional generalized linear models with streaming data. First, for the non-distributed setting, we propose a gradient-enhanced surrogate loss that approximates the cumulative loss using only historical summaries, which modifies and improves …

arXiv stat.ML TIER_1 English(EN) · Yiwei Zhou, Ziheng Chen · 2026-06-10 04:00

Deterministic Denominator Design for Localized Tamed Stochastic-Gradient Langevin Dynamics

arXiv:2606.10559v1 Announce Type: cross Abstract: Tamed stochastic-gradient Langevin dynamics (SGLD) stabilizes large drifts by adding a denominator to the update. If this denominator uses the same stochastic-gradient sample as the update step, it can also change the conditional …

arXiv stat.ML TIER_1 English(EN) · Morris Trestman, Stefan Gugler, Felix A. Faber, O. A. von Lilienfeld · 2026-06-10 04:00

Gradient-Guided Furthest Point Sampling for Robust Training Set Selection

arXiv:2510.08906v2 Announce Type: replace Abstract: Training set sampling methods are used to improve model performance and lower data costs in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Fur…

arXiv stat.ML TIER_1 English(EN) · Sobihan Surendran (LPSM), Adeline Fermanian (LPSM), Sylvain Le Corff (LPSM) · 2026-06-10 04:00

Latent Guided Sampling for Combinatorial Optimization

arXiv:2506.03672v2 Announce Type: replace Abstract: Combinatorial Optimization problems are widespread in domains such as logistics, manufacturing, and drug discovery, yet their NP-hard nature makes them computationally challenging. Recent Neural Combinatorial Optimization (NCO) …

arXiv stat.ML TIER_1 English(EN) · Sasan Vakili, Dani\"el Woonings, Pradyumna Paruchuri, Peyman Mohajerin Esfahani · 2026-06-10 04:00

Nonlinear Estimator: Dual Bayesian Affine Estimators for Parameter Learning

arXiv:2606.10111v1 Announce Type: cross Abstract: This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and…

arXiv stat.ML TIER_1 English(EN) · Gil Goldshlager, Jiang Hu, Lin Lin · 2026-06-10 04:00

A Sketch-and-Project Analysis of Subsampled Natural Gradient Algorithms

arXiv:2508.21022v3 Announce Type: replace-cross Abstract: Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample se…

arXiv stat.ML TIER_1 English(EN) · Yahong Yang, Juncai He · 2026-06-10 04:00

Deeper or Wider: A Perspective from Optimal Generalization Error with Sobolev Loss

arXiv:2402.00152v5 Announce Type: replace-cross Abstract: Constructing the architecture of a neural network is a challenging pursuit for the machine learning community, and the dilemma of whether to go deeper or wider remains a persistent question. This paper explores a compariso…

arXiv stat.ML TIER_1 English(EN) · Marc Becker, Lennart Schneider, Martin Binder, Lars Kotthoff, Bernd Bischl · 2026-06-10 04:00

mlr3mbo: Bayesian Optimization in R

arXiv:2603.29730v2 Announce Type: replace Abstract: We present mlr3mbo, a modular toolbox for Bayesian optimization in R. mlr3mbo supports single- and multi-objective optimization, multi-point proposals, batch and asynchronous parallelization, and robust error handling. While it …

arXiv stat.ML TIER_1 English(EN) · Dhruv Rohatgi · 2026-06-09 20:48

The Power of Test-Time Training for Approximate Sampling

Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been proposed to solve challenging reasoning problems.…

arXiv stat.ML TIER_1 English(EN) · Mayank Baranwal · 2026-06-09 18:18

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI)…

arXiv stat.ML TIER_1 English(EN) · Ziheng Chen · 2026-06-09 08:25

Deterministic Denominator Design for Localized Tamed Stochastic-Gradient Langevin Dynamics

Tamed stochastic-gradient Langevin dynamics (SGLD) stabilizes large drifts by adding a denominator to the update. If this denominator uses the same stochastic-gradient sample as the update step, it can also change the conditional mean drift. We study deterministic denominators: t…

arXiv stat.ML TIER_1 English(EN) · Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu · 2026-06-09 04:00

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

arXiv:2606.08850v1 Announce Type: cross Abstract: Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by fau…

arXiv stat.ML TIER_1 English(EN) · Filip Kova\v{c}evi\'c, Hong Chang Ji, Denny Wu, Mahdi Soltanolkotabi, Marco Mondelli · 2026-06-09 04:00

Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

arXiv:2602.02431v2 Announce Type: replace Abstract: It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. While this phenomenon has been extensively studied in linear regression, the benefit of multi-pass gradi…

arXiv stat.ML TIER_1 English(EN) · Tuan A. Vu, Harri L\"ahdesm\"aki, Julien Martinelli · 2026-06-09 04:00

In-Context Learning for Latent Space Bayesian Optimization

arXiv:2606.09664v1 Announce Type: cross Abstract: Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such a…

arXiv stat.ML TIER_1 English(EN) · Federico Bassetti, Vassili De Palma, Lucia Ladelli · 2026-06-09 04:00

Large deviation principles for convolutional Bayesian neural networks

arXiv:2603.06023v2 Announce Type: replace-cross Abstract: While suitably scaled CNNs with Gaussian initialization are known to converge to Gaussian processes as the number of channels diverges, little is known beyond this Gaussian limit. We establish a large deviation principle (…

arXiv stat.ML TIER_1 English(EN) · Trevor Campbell, Jonathan H. Huggins, Kyurae Kim, Charles C. Margossian · 2026-06-09 04:00

Large-scale empirical tuning and comparison of default optimizers for variational inference

arXiv:2606.07841v1 Announce Type: cross Abstract: Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuni…

arXiv stat.ML TIER_1 English(EN) · Wei-Cheng Lee, Francesco Orabona · 2026-06-09 04:00

A Robust $\widetilde{\mathcal{O}}(1/\sqrt{T})$ Rate for Unprojected TD Learning with Linear Function Approximation

arXiv:2506.01052v3 Announce Type: replace-cross Abstract: We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone of reinforcement learning. We are interested in the so-called ``robust'' setting,…

arXiv stat.ML TIER_1 English(EN) · Jaehoan Kim, Anirban Bhattacharya, Debdeep Pati · 2026-06-09 04:00

Adaptive Resolution for Finite-Rank Gaussian Processes

arXiv:2505.24066v2 Announce Type: replace-cross Abstract: Finite-rank approximations are widely used to scale Gaussian process (GP) regression, but their posterior behavior can differ from that of the corresponding parent GP prior. We study a class of finite-rank GP priors built …

arXiv stat.ML TIER_1 English(EN) · Peyman Mohajerin Esfahani · 2026-06-08 19:41

Nonlinear Estimator: Dual Bayesian Affine Estimators for Parameter Learning

This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retain…

arXiv stat.ML TIER_1 English(EN) · Julien Martinelli · 2026-06-08 15:45

In-Context Learning for Latent Space Bayesian Optimization

Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art r…

arXiv stat.ML TIER_1 English(EN) · Qianqian Lei, Soham Bonnerjee, Yuefeng Han, Wei Biao Wu · 2026-06-08 04:00

Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite $L_p$ Moments

arXiv:2606.06855v1 Announce Type: new Abstract: While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing high-probability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which ca…

arXiv stat.ML TIER_1 English(EN) · Kai Xu · 2026-06-07 21:43

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional …

arXiv stat.ML TIER_1 English(EN) · Charles C. Margossian · 2026-06-05 21:04

Large-scale empirical tuning and comparison of default optimizers for variational inference

Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuning, which undermines its promise as a truly "black…

arXiv stat.ML TIER_1 English(EN) · Daniel Haimovich, Fridolin Linder, Lorenzo Perini, Niek Tax, Milan Vojnovic · 2026-06-05 04:00

On the Convergence of Multicalibration Gradient Boosting

arXiv:2602.06773v2 Announce Type: replace-cross Abstract: Multicalibration gradient boosting has recently emerged as a scalable method that empirically produces approximately multicalibrated predictors and has been deployed at web scale. Despite this empirical success, its conver…

arXiv stat.ML TIER_1 English(EN) · Ziad Kobeissi (L2S), \'Elo\"ise Berthier (U2IS) · 2026-06-05 04:00

Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples

arXiv:2606.05967v1 Announce Type: new Abstract: In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning s…

arXiv stat.ML TIER_1 English(EN) · David Janz, Shuai Liu, Alex Ayoub, Csaba Szepesv\'ari · 2026-06-05 04:00

Exploration via linearly perturbed loss minimisation

arXiv:2311.07565v3 Announce Type: replace-cross Abstract: We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative…

arXiv stat.ML TIER_1 English(EN) · Yiwei Zhou, Ziheng Chen · 2026-06-05 04:00

Deterministic Envelopes for Tamed SGLD: Decoupling Stochastic-Gradient Noise and Localizing Taming

arXiv:2606.05242v1 Announce Type: new Abstract: Stochastic-gradient Langevin algorithms often use tamed denominators to stabilize non-globally Lipschitz drifts. This paper shows that when the denominator depends on the same stochastic-gradient realization as the numerator, the ta…

arXiv stat.ML TIER_1 English(EN) · Ziqian Wang, Chenxi Fang, Zhen Zhang · 2026-06-05 04:00

DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables

arXiv:2606.05247v1 Announce Type: cross Abstract: Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the con…

arXiv stat.ML TIER_1 English(EN) · Wei Biao Wu · 2026-06-05 02:59

Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite $L_p$ Moments

While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing high-probability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which can be overly restrictive for modern settings with…

arXiv stat.ML TIER_1 English(EN) · Éloïse Berthier · 2026-06-04 10:10

Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. W…

arXiv stat.ML TIER_1 English(EN) · Éloïse Berthier · 2026-06-04 10:10

Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. W…

arXiv stat.ML TIER_1 English(EN) · Paul D\"utting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson, Morteza Zadimoghaddam · 2026-06-04 04:00

A General Framework for Dynamic Consistent Submodular Maximization

arXiv:2606.04946v1 Announce Type: cross Abstract: Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored …

arXiv stat.ML TIER_1 English(EN) · Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo · 2026-06-04 04:00

Bayesian learning for the stochastic shortest path problem

arXiv:2606.04845v1 Announce Type: new Abstract: Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We deve…

arXiv stat.ML TIER_1 English(EN) · Morteza Zadimoghaddam · 2026-06-03 14:35

A General Framework for Dynamic Consistent Submodular Maximization

Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where t…

arXiv stat.ML TIER_1 English(EN) · Jiaqi Guo · 2026-06-03 13:13

Bayesian learning for the stochastic shortest path problem

Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal de…

arXiv stat.ML TIER_1 English(EN) · Zhen Zhang · 2026-06-03 11:58

DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables

Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational…

arXiv stat.ML TIER_1 English(EN) · Ziheng Chen · 2026-06-03 07:23

Deterministic Envelopes for Tamed SGLD: Decoupling Stochastic-Gradient Noise and Localizing Taming

Stochastic-gradient Langevin algorithms often use tamed denominators to stabilize non-globally Lipschitz drifts. This paper shows that when the denominator depends on the same stochastic-gradient realization as the numerator, the taming step changes the stochastic oracle itself a…

arXiv stat.ML TIER_1 English(EN) · Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou · 2026-06-03 04:00

Online Learning with Gradient-Variation Interval Regret

arXiv:2606.03831v1 Announce Type: cross Abstract: This paper investigates non-stationary online learning using the metric of interval regret, which requires an online algorithm to perform well over every time interval. We propose the first online learning algorithm that achieves …

arXiv stat.ML TIER_1 English(EN) · Hyunseok Seung, Matthias Katzfuss · 2026-06-03 04:00

Scalable Derivative Gaussian Processes via Exact Gradient Reduction

arXiv:2606.02909v1 Announce Type: new Abstract: Gradient observations can substantially improve Gaussian process (GP) surrogates, particularly in high-dimensional settings where function evaluations are expensive. However, exact inference with $n$ function values and $n$ full gra…

arXiv stat.ML TIER_1 English(EN) · Zhi-Hua Zhou · 2026-06-02 16:16

Online Learning with Gradient-Variation Interval Regret

This paper investigates non-stationary online learning using the metric of interval regret, which requires an online algorithm to perform well over every time interval. We propose the first online learning algorithm that achieves an interval regret bound scaling with gradient var…

arXiv stat.ML TIER_1 English(EN) · Dmitrii M. Ostrovskii · 2026-06-02 04:00

Near-Optimal and Tractable Estimation under Shift-Invariance

arXiv:2411.03383v3 Announce Type: replace-cross Abstract: How hard is it to estimate a discrete-time signal $(x_{1}, ..., x_{n}) \in \mathbb{C}^n$ satisfying an unknown linear recurrence relation of order $s$ and observed in i.i.d. complex Gaussian noise? The class of all such si…

arXiv stat.ML TIER_1 English(EN) · Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger, Sebastian Trimpe · 2026-06-02 04:00

Local Preferential Bayesian Optimization

arXiv:2606.02351v1 Announce Type: cross Abstract: Bayesian optimization (BO) is a popular and effective approach for tuning expensive, noisy experiments, but requires the formulation of an explicit objective function. Preferential BO (PBO) removes this requirement by learning fro…

arXiv stat.ML TIER_1 English(EN) · Zijian Liu · 2026-06-02 04:00

In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise

arXiv:2606.00520v1 Announce Type: cross Abstract: Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite $p$-th moment for $p\in\left(1,2\right)$, a setting known as the heavy-tailed noise assumption. However, some r…

arXiv stat.ML TIER_1 English(EN) · Dimitris Oikonomou, Nicolas Loizou · 2026-06-02 04:00

Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

arXiv:2512.02342v3 Announce Type: replace-cross Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optim…

arXiv stat.ML TIER_1 English(EN) · Yuanzhe Tao, Yifeng Liu, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu · 2026-06-02 04:00

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

arXiv:2412.19444v2 Announce Type: replace-cross Abstract: Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, ad-hoc tuning of learning rates …

arXiv stat.ML TIER_1 English(EN) · Tongyu Li, Alexander Giessing · 2026-06-02 04:00

Statistical Inference on Gradient Flows

arXiv:2606.01257v1 Announce Type: cross Abstract: Gradient-based algorithms are central to modern statistical estimation, yet their statistical analysis is often restricted to fixed-time behavior, such as convergence to a population target or fluctuations at a prescribed iteratio…

arXiv stat.ML TIER_1 English(EN) · Luca Muscarnera, Silas Ruhrberg Est\'evez, Yuanzhang Xiao, Mihaela Van der Schaar · 2026-06-02 04:00

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

arXiv:2606.01521v1 Announce Type: cross Abstract: A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes…

arXiv stat.ML TIER_1 English(EN) · Yuexiao Dong, Kenichiro Mcalinn, Edoardo Airoldi, Lei Li · 2026-06-02 04:00

FlowSDR: Sufficient Dimension Reduction via Conditional Normalizing Flows

arXiv:2606.01346v1 Announce Type: cross Abstract: Sufficient dimension reduction (SDR) seeks a low-dimensional linear projection of predictors that preserves the conditional distribution of the response. Existing methods target this conditional distribution indirectly, via invers…

arXiv stat.ML TIER_1 English(EN) · Thibault Pautrel, Fran\c{c}ois Portier · 2026-06-02 04:00

Riemannian Stochastic Optimization for Sufficient Dimension Reduction

arXiv:2606.00413v1 Announce Type: new Abstract: Sufficient dimension reduction (SDR) makes high-dimensional regression tractable by projecting the covariates onto a low-dimensional subspace that preserves the conditional mean of the response. Existing gradient-based estimators ei…

arXiv stat.ML TIER_1 English(EN) · Matthias Katzfuss · 2026-06-01 21:29

Scalable Derivative Gaussian Processes via Exact Gradient Reduction

Gradient observations can substantially improve Gaussian process (GP) surrogates, particularly in high-dimensional settings where function evaluations are expensive. However, exact inference with $n$ function values and $n$ full gradients in $d$ dimensions scales cubically in the…

arXiv stat.ML TIER_1 English(EN) · Sebastian Trimpe · 2026-06-01 15:00

Local Preferential Bayesian Optimization

Bayesian optimization (BO) is a popular and effective approach for tuning expensive, noisy experiments, but requires the formulation of an explicit objective function. Preferential BO (PBO) removes this requirement by learning from pairwise human feedback, yet existing methods st…

arXiv stat.ML TIER_1 English(EN) · Dario Draca, Takuo Matsubara, Minh-Ngoc Tran · 2026-06-01 04:00

Inversion-Free Natural Gradient Descent on Riemannian Manifolds

arXiv:2604.02969v2 Announce Type: replace Abstract: The natural gradient method is a central tool for statistical optimisation, but its broader application is hindered by the assumption of a Euclidean parameter space, the repeated estimation of the Fisher information matrix (FIM)…

arXiv stat.ML TIER_1 English(EN) · Michael Ibrahim, Hanqi Zhao, Eli Sennesh, Zhi Li, Anqi Wu, Jacob L. Yates, Chengrui Li, Hadi Vafaii · 2026-06-01 04:00

A hitchhiker's guide to Poisson gradient estimation

arXiv:2602.03896v2 Announce Type: replace Abstract: Poisson-distributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this: *Exponential Arrival Time* (EAT)…

arXiv stat.ML TIER_1 Deutsch(DE) · Facheng Yu, Ronak Mehta, Alex Luedtke, Zaid Harchaoui · 2026-06-01 04:00

Stochastic Gradients under Nuisances

arXiv:2508.20326v2 Announce Type: replace Abstract: Stochastic gradient optimization is the dominant learning paradigm for a variety of scenarios, from classical supervised learning to modern self-supervised learning. We consider stochastic gradient algorithms for learning proble…

arXiv stat.ML TIER_1 English(EN) · Mihaela Van der Schaar · 2026-06-01 00:54

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and opt…

arXiv stat.ML TIER_1 English(EN) · Lei Li · 2026-05-31 16:54

FlowSDR: Sufficient Dimension Reduction via Conditional Normalizing Flows

Sufficient dimension reduction (SDR) seeks a low-dimensional linear projection of predictors that preserves the conditional distribution of the response. Existing methods target this conditional distribution indirectly, via inverse moments, local forward regression, or neural ens…

arXiv stat.ML TIER_1 English(EN) · Alexander Giessing · 2026-05-31 14:22

Statistical Inference on Gradient Flows

Gradient-based algorithms are central to modern statistical estimation, yet their statistical analysis is often restricted to fixed-time behavior, such as convergence to a population target or fluctuations at a prescribed iteration. In many applications, however, uncertainty quan…

arXiv stat.ML TIER_1 English(EN) · Zijian Liu · 2026-05-30 04:27

In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise

Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite $p$-th moment for $p\in\left(1,2\right)$, a setting known as the heavy-tailed noise assumption. However, some recent studies have found that Stochastic Gradient …

arXiv stat.ML TIER_1 English(EN) · François Portier · 2026-05-29 23:06

Riemannian Stochastic Optimization for Sufficient Dimension Reduction

Sufficient dimension reduction (SDR) makes high-dimensional regression tractable by projecting the covariates onto a low-dimensional subspace that preserves the conditional mean of the response. Existing gradient-based estimators either operate in the ambient space and suffer fro…

arXiv stat.ML TIER_1 English(EN) · Rocco Caprio, Adrien Corenflos, Sam Power · 2026-05-29 04:00

Wasserstein Contraction of Coordinate Ascent Variational Inference

arXiv:2605.30253v1 Announce Type: new Abstract: We study the contraction in Wasserstein distance of the coordinate ascent variational inference algorithm. This is shown to hold under a transport-information inequality at the fixed points and a functional smoothness condition. The…

arXiv stat.ML TIER_1 English(EN) · Rustem Islamov, Michael Crawshaw, Jeremy Cohen, Robert Gower · 2026-05-29 04:00

Non-Euclidean Gradient Descent Operates at the Edge of Stability

arXiv:2603.05002v2 Announce Type: replace-cross Abstract: The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian approaches and then hovers near the stability threshold $2/\eta$ during gradient descent (GD) with step size $\eta$. Despi…

arXiv stat.ML TIER_1 English(EN) · Sam Power · 2026-05-28 17:16

Wasserstein Contraction of Coordinate Ascent Variational Inference

We study the contraction in Wasserstein distance of the coordinate ascent variational inference algorithm. This is shown to hold under a transport-information inequality at the fixed points and a functional smoothness condition. The results are general and sharp, allow for local …

arXiv stat.ML TIER_1 English(EN) · Jack Timmermans, Sergio A. Alvarez · 2026-05-28 04:00

Optimal ridge regularization revisited

arXiv:2605.28679v1 Announce Type: cross Abstract: We consider $L^2$-regularized linear (ridge) regression over a finite data sample $X$ with bounded covariance and linear prediction targets $y$ with additive isotropic noise of finite variance. We present an iterative procedure to…

arXiv stat.ML TIER_1 English(EN) · Yibo Jacky Zhang, Zeyu Tang, Sanmi Koyejo · 2026-05-28 04:00

Is Backpropagation Optimal? When Synthetic Gradients Improve Sample Efficiency

arXiv:2605.27946v1 Announce Type: new Abstract: Backpropagation is the default learning rule for artificial neural networks and is often treated as the settled approach whenever differentiability is available. In this work, we revisit this convention through a theoretical lens of…

arXiv stat.ML TIER_1 English(EN) · Sergei Tikhonov, Arsen Vasilyan · 2026-05-28 04:00

Proper Agnostic Learning of Functions of Halfspaces under Gaussian Marginals

arXiv:2605.27594v1 Announce Type: cross Abstract: We study the problem of computationally efficient proper agnostic learning of multidimensional concept classes under the Gaussian distribution. In this setting, given i.i.d. labeled samples from an unknown distribution over $\math…

arXiv stat.ML TIER_1 English(EN) · Qin Lu, Konstantinos D. Polyzos, Bingcong Li, Georgios B. Giannakis · 2026-05-28 04:00

Surrogate modeling for Bayesian optimization beyond a single Gaussian process

arXiv:2205.14090v2 Announce Type: replace Abstract: Bayesian optimization (BO) has well-documented merits for optimizing black-box functions with an expensive evaluation cost. Such functions emerge in applications as diverse as hyperparameter tuning, drug discovery, and robotics.…

arXiv stat.ML TIER_1 English(EN) · Kam\'elia Daudel, Fran\c{c}ois Roueff · 2026-05-28 04:00

Learning with Importance Weighted Variational Inference

arXiv:2410.12035v2 Announce Type: replace Abstract: Several variational bounds involving importance weighting ideas generalize the Evidence Lower BOund (ELBO) for marginal likelihood optimization, such as the Importance-weighted Auto-Encoder (IWAE), Variational R\'enyi (VR) and V…

arXiv stat.ML TIER_1 English(EN) · Tam Le (LPSM) · 2026-05-28 04:00

Unregularized limit of stochastic gradient method for Wasserstein distributionally robust optimization

arXiv:2506.04948v2 Announce Type: replace-cross Abstract: Wasserstein distributionally robust optimization offers a framework for model fitting in machine learning under potential shifts in the data distribution. We study a regularized variant of this problem in which entropic sm…

arXiv stat.ML TIER_1 English(EN) · Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim · 2026-05-28 04:00

Flatness-Aware Stochastic Gradient Langevin Dynamics

arXiv:2510.02174v3 Announce Type: replace-cross Abstract: Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose Flatness-Aware Stochastic…

arXiv stat.ML TIER_1 English(EN) · Sergio A. Alvarez · 2026-05-27 16:12

Optimal ridge regularization revisited

We consider $L^2$-regularized linear (ridge) regression over a finite data sample $X$ with bounded covariance and linear prediction targets $y$ with additive isotropic noise of finite variance. We present an iterative procedure to compute the optimal regularization strength numer…

arXiv stat.ML TIER_1 English(EN) · Sanmi Koyejo · 2026-05-27 04:37

Is Backpropagation Optimal? When Synthetic Gradients Improve Sample Efficiency

Backpropagation is the default learning rule for artificial neural networks and is often treated as the settled approach whenever differentiability is available. In this work, we revisit this convention through a theoretical lens of sample efficiency. We introduce a unified vecto…

arXiv stat.ML TIER_1 English(EN) · Zhaosong Lu, Xiangyuan Wang · 2026-05-27 04:00

A first-order method for constrained nonconvex-nonconcave minimax optimization

arXiv:2510.01168v3 Announce Type: replace-cross Abstract: We study a class of constrained nonconvex-nonconcave minimax optimization problems in which the inner maximization involves potentially complex constraints. Under the assumption that the inner problem of a novel lifted min…

arXiv stat.ML TIER_1 English(EN) · Zusen Xu, Jia-Jie Zhu · 2026-05-27 04:00

Gradient Flow Sampler-based Distributionally Robust Optimization

arXiv:2510.25956v3 Announce Type: replace-cross Abstract: We propose a mathematically principled PDE gradient flow framework for distributionally robust optimization (DRO). Exploiting the recent advances in the intersection of Markov Chain Monte Carlo sampling and gradient flow t…

arXiv stat.ML TIER_1 English(EN) · Mikalai Korbit, Mario Zanon · 2026-05-27 04:00

Incremental Gauss-Newton Descent for Machine Learning

arXiv:2408.05560v2 Announce Type: replace-cross Abstract: Stochastic gradient updates are widely used for their efficiency and scalability, but their effective step sizes can depend strongly on feature scaling and local model sensitivity. Gauss-Newton methods address such scale e…

arXiv stat.ML TIER_1 English(EN) · Arsen Vasilyan · 2026-05-26 19:07

Proper Agnostic Learning of Functions of Halfspaces under Gaussian Marginals

We study the problem of computationally efficient proper agnostic learning of multidimensional concept classes under the Gaussian distribution. In this setting, given i.i.d. labeled samples from an unknown distribution over $\mathbb{R}^d \times \{\pm 1\}$ whose marginal on $\math…

arXiv stat.ML TIER_1 English(EN) · Navil Nandhan, Abbas Khademi, Antonio Silveti-Falls · 2026-05-26 04:00

Boosted Stochastic Frank-Wolfe for Constrained Nonconvex Optimization

arXiv:2605.25255v1 Announce Type: cross Abstract: The boosted Frank-Wolfe algorithm accelerates the classical Frank-Wolfe algorithm by better aligning the update direction with the negative gradient. Its analysis, however, has been limited to deterministic convex problems, with s…

arXiv stat.ML TIER_1 English(EN) · Wenhao Yang · 2026-05-25 16:18

Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting dist…

arXiv stat.ML TIER_1 English(EN) · Antonio Silveti-Falls · 2026-05-24 21:04

Boosted Stochastic Frank-Wolfe for Constrained Nonconvex Optimization

The boosted Frank-Wolfe algorithm accelerates the classical Frank-Wolfe algorithm by better aligning the update direction with the negative gradient. Its analysis, however, has been limited to deterministic convex problems, with step sizes that require either line search or knowl…

arXiv stat.ML TIER_1 English(EN) · Krishnakumar Balasubramanian · 2026-05-22 04:00

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

arXiv:2605.22795v1 Announce Type: new Abstract: We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based drifting velocity by a kernel density estimator (KDE)-gradient velocity, namely the differen…

arXiv cs.CV TIER_1 English(EN) · Gang Dai, Yining Huang, Yiming Xia, Guohao Chen, Shuaicheng Niu · 2026-05-22 04:00

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

arXiv:2605.21907v1 Announce Type: new Abstract: The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from i…

arXiv stat.ML TIER_1 English(EN) · Krishnakumar Balasubramanian · 2026-05-21 17:49

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based drifting velocity by a kernel density estimator (KDE)-gradient velocity, namely the difference of the kernel-smoothed data score and the ker…

arXiv stat.ML TIER_1 English(EN) · Tansheng Zhu, Hongyu Zhou, Ke Jin, Xusheng Xu, Qiufan Yuan, Lijie Ji · 2026-05-21 04:00

Bayesian Optimization by Kernel Regression and Density-based Exploration

arXiv:2502.06178v5 Announce Type: replace-cross Abstract: Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the cubic per-iteration cost of Gaussian processes, which results…

arXiv stat.ML TIER_1 Italiano(IT) · Fares El Khoury, Houssam Zenati, Nathan Kallus, Michael Arbel, Aur\'elien Bibaut · 2026-05-21 04:00

Semiparametric Efficient Bilevel Gradient Estimation

arXiv:2605.21341v1 Announce Type: new Abstract: Functional bilevel methods estimate a lower-level function and plug it into a hypergradient, but this plug-in gradient can retain first-order bias when the lower-level problem is learned nonparametrically. To remove this bias, we de…

arXiv stat.ML TIER_1 English(EN) · Shubhada Agrawal, Siva Theja Maguluri, Martin Zubeldia · 2026-05-21 04:00

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

arXiv:2605.20999v1 Announce Type: cross Abstract: We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. Wh…

arXiv stat.ML TIER_1 Italiano(IT) · Aurélien Bibaut · 2026-05-20 16:07

Semiparametric Efficient Bilevel Gradient Estimation

Functional bilevel methods estimate a lower-level function and plug it into a hypergradient, but this plug-in gradient can retain first-order bias when the lower-level problem is learned nonparametrically. To remove this bias, we develop a semiparametric debiasing theory for popu…

arXiv stat.ML TIER_1 English(EN) · Martin Zubeldia · 2026-05-20 10:38

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. When the Martingale-difference noise is bounded, we …

arXiv stat.ML TIER_1 English(EN) · Kyurae Kim, Qiang Fu, Yi-An Ma, Jacob R. Gardner, Trevor Campbell · 2026-05-20 04:00

Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space

arXiv:2602.18718v2 Announce Type: replace Abstract: For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) p…

arXiv stat.ML TIER_1 English(EN) · Sharan Sahu, Cameron J. Hogan, Martin T. Wells · 2026-05-20 04:00

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

arXiv:2601.12238v4 Announce Type: replace Abstract: In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothnes…

arXiv stat.ML TIER_1 English(EN) · Yohann De Castro (ICJ, ECL, IUF, PSPM), S\'ebastien Gadat (TSE-R, IUF), Cl\'ement Marteau (ICJ, UCBL, PSPM) · 2026-05-20 04:00

Fast Spawn\&Prune (FS\&P): Global convergence of stochastic conic particle gradient descent via birth/death process

arXiv:2605.19784v1 Announce Type: cross Abstract: We investigate the global optimization of the objective function arising in continuous sparse regression, specifically the Beurling LASSO (BLASSO), over the space of measures. While Conic Particle Gradient Descent (CPGD) methods a…

arXiv stat.ML TIER_1 English(EN) · Clément Marteau · 2026-05-19 12:50

Fast Spawn\&Prune (FS\&P): Global convergence of stochastic conic particle gradient descent via birth/death process

We investigate the global optimization of the objective function arising in continuous sparse regression, specifically the Beurling LASSO (BLASSO), over the space of measures. While Conic Particle Gradient Descent (CPGD) methods are computationally efficient, they may become trap…

arXiv stat.ML TIER_1 English(EN) · Wa\"iss Azizian, Franck Iutzeler, J\'er\^ome Malick, Panayotis Mertikopoulos · 2026-05-19 04:00

What is the long-run distribution of stochastic gradient descent? A large deviations analysis

arXiv:2406.09241v3 Announce Type: replace-cross Abstract: In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be…

arXiv stat.ML TIER_1 English(EN) · Zijian Liu · 2026-05-19 04:00

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

arXiv:2512.23178v3 Announce Type: replace-cross Abstract: Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient n…

arXiv stat.ML TIER_1 English(EN) · Tobias Brock, Thomas Nagler · 2026-05-19 04:00

Fast Rates for Nonstationary Weighted Risk Minimization

arXiv:2602.05742v2 Announce Type: replace Abstract: Weighted empirical risk minimization is a common approach to prediction under distribution drift. This article studies its out-of-sample prediction error under nonstationarity. We provide a general decomposition of the excess ri…

arXiv stat.ML TIER_1 English(EN) · Ye He, Krishnakumar Balasubramanian, Sayan Banerjee, Promit Ghosal · 2026-05-19 04:00

Finite-Particle Rates for Regularized Stein Variational Gradient Descent

arXiv:2602.05172v2 Announce Type: replace Abstract: We derive finite-particle rates for the regularized Stein variational gradient descent (R-SVGD) algorithm introduced by He et al. (2024) that corrects the constant-order bias of the SVGD by applying a resolvent-type precondition…

arXiv stat.ML TIER_1 English(EN) · Zijian Liu · 2026-05-19 04:00

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

arXiv:2605.18694v1 Announce Type: cross Abstract: Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient no…

arXiv stat.ML TIER_1 English(EN) · Zijian Liu · 2026-05-18 17:30

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the co…

r/MachineLearning TIER_1 English(EN) · /u/Otaku_7nfy · 2026-06-03 11:57

TorchDAE: Implicit DAE Solvers with Index Reduction and Adjoint Sensitivity [P]

<div class="md">Hello everyone, I've been working on a PyTorch library for solving Differential Algebraic Equations (DAEs) that supports vectorized execution and GPU acceleration. The library implements several algorithms that are not currently ava…

COVERAGE [235]