New research advances bandit algorithms for control, causality, and multi-objective learning

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 36 sources

Multiple research papers explore advancements in bandit algorithms across various domains. One study introduces a machine learning framework for optimal control of fluid restless multi-armed bandit problems, achieving significant speed-ups in applications like machine maintenance and epidemic control. Another paper challenges the optimality of graph learning in causal bandits, proposing new algorithms that bypass graph recovery for improved regret minimization. Further research investigates the complexity of multi-objective bandits, showing Pareto regret scales similarly to single-objective problems, and explores bandit learning in open multi-agent systems with dynamic agent populations. Additional work addresses constrained contextual bandits with adversarial contexts, misspecified kernelized bandit optimization, and a unified framework for distributional regret in bandits and reinforcement learning. AI

Summary written by gemini-2.5-flash-lite from 36 sources. How we write summaries →

IMPACT These papers advance theoretical understanding and algorithmic approaches in multi-armed bandits and related reinforcement learning problems, potentially leading to more efficient and robust AI systems in various applications.

RANK_REASON Cluster consists of multiple arXiv papers on theoretical machine learning topics.

Read on arXiv cs.LG →

paper
other

COVERAGE [36]

arXiv cs.LG TIER_1 · Xinyu Li · 2026-05-11 10:13

Signature Approach for Contextual Bandits with Nonlinear and Path-dependent Rewards

We study contextual bandits with nonlinear and path-dependent rewards through a novel signature-transform-based approach. Leveraging the universal nonlinearity property of signatures, we approximate continuous path-dependent reward functionals by linear functionals in the signatu…
arXiv cs.LG TIER_1 · Shogo Iwazaki · 2026-05-11 09:56

Nearly-Optimal Algorithm for Adversarial Kernelized Bandits

This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm ach…
arXiv cs.LG TIER_1 · Carla Fabiana Chiasserini · 2026-05-11 07:57

Learning to Sparsify Stochastic Linear Bandits

This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in…
arXiv cs.LG TIER_1 · Bibhas Chakraborty · 2026-05-11 07:46

PFN-TS: Thompson Sampling for Contextual Bandits via Prior-Data Fitted Networks

Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this p…
arXiv cs.LG TIER_1 · Dimitris Bertsimas, Cheol Woo Kim, Jos\'e Ni\~no-Mora · 2026-05-08 04:00

Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

arXiv:2502.03725v2 Announce Type: replace Abstract: We present a novel machine learning framework for the optimal control of fluid restless multi-armed bandit problems (FRMABPs) with state equations that are either affine or quadratic in the state variables. By establishing funda…
arXiv cs.AI TIER_1 · Qirun Zeng, Xuchuang Wang, Jiayi Shen, Xutong Liu, Fang Kong, Jinhang Zuo · 2026-05-08 04:00

Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

arXiv:2605.05745v1 Announce Type: new Abstract: We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) fee…
arXiv cs.LG TIER_1 · Changkun Guan, Mengfan Xu · 2026-05-08 04:00

Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

arXiv:2604.07096v2 Announce Type: replace Abstract: Multi-objective bandits have attracted increasing attention for their broad applicability, with $d$-dimensional reward vectors inducing Pareto regret. There has been a subtle debate over whether this added structure makes the …
arXiv cs.LG TIER_1 · Mohammad Shahverdikondori, Jalal Etesami, Negar Kiyavash · 2026-05-08 04:00

Graph Learning Is Suboptimal in Causal Bandits

arXiv:2510.16811v3 Announce Type: replace Abstract: We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward's parents and then applying classic ban…
arXiv cs.LG TIER_1 · Davide Maran, Csaba Szepesv\'ari · 2026-05-08 04:00

Sharper Guarantees for Misspecified Kernelized Bandit Optimization

arXiv:2605.05967v1 Announce Type: new Abstract: Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, wh…
arXiv cs.LG TIER_1 · Dhruv Sarkar, Abhishek Sinha · 2026-05-08 04:00

Constrained Contextual Bandits with Adversarial Contexts

arXiv:2605.06190v1 Announce Type: new Abstract: We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, reward…
arXiv cs.LG TIER_1 · Mengfan Xu · 2026-05-08 04:00

Bandit Learning in General Open Multi-agent Systems

arXiv:2605.06202v1 Announce Type: new Abstract: Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imp…
arXiv cs.LG TIER_1 · Harin Lee, Min-hwan Oh · 2026-05-07 04:00

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

arXiv:2605.05102v1 Announce Type: new Abstract: We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over…
arXiv cs.LG TIER_1 · Stefana-Lucia Anita, Gabriel Turinici · 2026-05-06 04:00

Vanishing L2 regularization for the softmax Multi Armed Bandit

arXiv:2605.03752v1 Announce Type: new Abstract: Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal p…
arXiv cs.LG TIER_1 · Michal Valko · 2026-05-06 04:00

Bandits on graphs and structures

arXiv:2605.03493v1 Announce Type: new Abstract: The goal of this thesis is to investigate the structural properties of certain sequential problems in order to bring the solutions closer to a practical use. In the first part, we put a special emphasis on structures that can be rep…
arXiv cs.LG TIER_1 · Gabriel Turinici · 2026-05-05 13:37

Vanishing L2 regularization for the softmax Multi Armed Bandit

Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal policy and served as the foundation for downstrea…
arXiv cs.LG TIER_1 · Michal Valko · 2026-05-05 08:33

Bandits on graphs and structures

The goal of this thesis is to investigate the structural properties of certain sequential problems in order to bring the solutions closer to a practical use. In the first part, we put a special emphasis on structures that can be represented as graphs on actions. In the second par…
arXiv cs.LG TIER_1 · Maria-Florina Balcan, Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Keegan Harris, Zhiwei Steven Wu · 2026-05-05 04:00

Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

arXiv:2502.00204v3 Announce Type: replace Abstract: We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after wh…
arXiv cs.LG TIER_1 · Jingxin Zhan, Yuze Han, Zhihua Zhang · 2026-05-04 04:00

Last-Iterate Analyses of FTRL with the 1/2-Tsallis Entropy in Stochastic Bandits

arXiv:2510.22819v2 Announce Type: replace Abstract: The convergence analysis of online learning algorithms is central to machine learning theory, where the last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolutio…
arXiv cs.LG TIER_1 · Zichun Ye, Runqi Wang, Xuchuang Wang, Xutong Liu, Shuai Li, Mohammad Hajiesmaili · 2026-05-04 04:00

Unlearning Offline Stochastic Multi-Armed Bandits

arXiv:2605.00638v1 Announce Type: new Abstract: Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / superv…
arXiv cs.LG TIER_1 · Amith Bhat, Haipeng Luo, Aadirupa Saha · 2026-05-04 04:00

One Good Source is All You Need: Near-Optimal Regret for Bandits under Heterogeneous Noise

arXiv:2602.14474v2 Announce Type: replace Abstract: We study $K$-armed Multiarmed Bandit (MAB) problem with $M$ heterogeneous data sources, each exhibiting unknown and distinct noise variances $\{\sigma_j^2\}_{j=1}^M$. The learner's objective is standard MAB regret minimization, …
arXiv cs.LG TIER_1 · Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, Yun-En Liu · 2026-05-04 04:00

Trading off rewards and errors in multi-armed bandits

arXiv:2605.00488v1 Announce Type: new Abstract: In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and presen…
arXiv cs.LG TIER_1 · Mohammad Hajiesmaili · 2026-05-01 13:20

Unlearning Offline Stochastic Multi-Armed Bandits

Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / supervised machine unlearning, leaving unlearning for …
arXiv cs.LG TIER_1 · Yun-En Liu · 2026-05-01 07:54

Trading off rewards and errors in multi-armed bandits

In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that inter…
arXiv stat.ML TIER_1 · Sushant Vijayan, Arun Suggala, Karthikeyan Shanmugam, Soumyabrata Pal · 2026-05-13 04:00

Regret minimization in Linear Bandits with offline data via extended D-optimal exploration

arXiv:2508.08420v3 Announce Type: replace-cross Abstract: We consider the problem of online regret minimization in linear bandits with access to prior observations (offline data) from the underlying bandit model. There are numerous applications where extensive offline data is oft…
arXiv stat.ML TIER_1 · Chengyu Du, Mengfan Xu · 2026-05-11 04:00

Conformal-Style Quantile Analyses for Stochastic Bandits

arXiv:2605.07115v1 Announce Type: cross Abstract: Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level $\alpha$, the natural upper-t…
arXiv stat.ML TIER_1 · Ishank Juneja, Carlee Joe-Wong, Osman Ya\u{g}an · 2026-05-11 04:00

Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy

arXiv:2605.07171v1 Announce Type: cross Abstract: The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the mi…
arXiv stat.ML TIER_1 · Avishek Ghosh · 2026-05-10 10:13

Optimal Regret for Single Index Bandits

We study the $\textit{single-index bandit}$ problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particul…
arXiv stat.ML TIER_1 · Quanquan Gu · 2026-05-09 23:17

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $ε^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the sta…
arXiv stat.ML TIER_1 · Osman Yağan · 2026-05-08 03:07

Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy

The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the minimum permissible reward, an objective captured by…
arXiv stat.ML TIER_1 · Mengfan Xu · 2026-05-08 01:44

Conformal-Style Quantile Analyses for Stochastic Bandits

Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level $α$, the natural upper-tail target of arm $j$ is the upper endpoint \(F_j^{-1…
arXiv stat.ML TIER_1 · Mengfan Xu · 2026-05-07 13:10

Bandit Learning in General Open Multi-agent Systems

Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently …
arXiv stat.ML TIER_1 · Csaba Szepesvári · 2026-05-07 10:12

Sharper Guarantees for Misspecified Kernelized Bandit Optimization

Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the kernel effective dim…
arXiv stat.ML TIER_1 · Min-hwan Oh · 2026-05-06 16:38

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $δ\in (0,1]$, thereby cha…
arXiv stat.ML TIER_1 · Min-hwan Oh · 2026-05-06 16:38

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $δ\in (0,1]$, thereby cha…
arXiv stat.ML TIER_1 · Kaixuan Ji, Qiwei Di, Heyang Zhao, Qingyue Zhao, Quanquan Gu · 2026-05-05 04:00

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

arXiv:2605.02141v1 Announce Type: cross Abstract: Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Ne…
arXiv stat.ML TIER_1 · Quanquan Gu · 2026-05-04 01:46

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regu…

COVERAGE [36]

RELATED TOPICS