New research advances bandit algorithms for control, causality, and multi-objective learning
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 36 sources
Multiple research papers explore advancements in bandit algorithms across various domains. One study introduces a machine learning framework for optimal control of fluid restless multi-armed bandit problems, achieving significant speed-ups in applications like machine maintenance and epidemic control. Another paper challenges the optimality of graph learning in causal bandits, proposing new algorithms that bypass graph recovery for improved regret minimization. Further research investigates the complexity of multi-objective bandits, showing Pareto regret scales similarly to single-objective problems, and explores bandit learning in open multi-agent systems with dynamic agent populations. Additional work addresses constrained contextual bandits with adversarial contexts, misspecified kernelized bandit optimization, and a unified framework for distributional regret in bandits and reinforcement learning.
AI
IMPACT
These papers advance theoretical understanding and algorithmic approaches in multi-armed bandits and related reinforcement learning problems, potentially leading to more efficient and robust AI systems in various applications.
RANK_REASON
Cluster consists of multiple arXiv papers on theoretical machine learning topics.
We study contextual bandits with nonlinear and path-dependent rewards through a novel signature-transform-based approach. Leveraging the universal nonlinearity property of signatures, we approximate continuous path-dependent reward functionals by linear functionals in the signatu…
This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm ach…
This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in…
Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this p…
arXiv cs.LG
TIER_1·Dimitris Bertsimas, Cheol Woo Kim, Jos\'e Ni\~no-Mora·
arXiv:2502.03725v2 Announce Type: replace Abstract: We present a novel machine learning framework for the optimal control of fluid restless multi-armed bandit problems (FRMABPs) with state equations that are either affine or quadratic in the state variables. By establishing funda…
arXiv:2605.05745v1 Announce Type: new Abstract: We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) fee…
arXiv:2604.07096v2 Announce Type: replace Abstract: Multi-objective bandits have attracted increasing attention for their broad applicability, with \(d\)-dimensional reward vectors inducing Pareto regret. There has been a subtle debate over whether this added structure makes the …
arXiv:2510.16811v3 Announce Type: replace Abstract: We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward's parents and then applying classic ban…
arXiv:2605.05967v1 Announce Type: new Abstract: Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, wh…
arXiv:2605.06190v1 Announce Type: new Abstract: We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, reward…
arXiv:2605.06202v1 Announce Type: new Abstract: Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imp…
arXiv:2605.05102v1 Announce Type: new Abstract: We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over…
arXiv cs.LG
TIER_1·Stefana-Lucia Anita, Gabriel Turinici·
arXiv:2605.03752v1 Announce Type: new Abstract: Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal p…
arXiv:2605.03493v1 Announce Type: new Abstract: The goal of this thesis is to investigate the structural properties of certain sequential problems in order to bring the solutions closer to a practical use. In the first part, we put a special emphasis on structures that can be rep…
Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal policy and served as the foundation for downstrea…
The goal of this thesis is to investigate the structural properties of certain sequential problems in order to bring the solutions closer to a practical use. In the first part, we put a special emphasis on structures that can be represented as graphs on actions. In the second par…
arXiv cs.LG
TIER_1·Maria-Florina Balcan, Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Keegan Harris, Zhiwei Steven Wu·
arXiv:2502.00204v3 Announce Type: replace Abstract: We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after wh…
arXiv:2510.22819v2 Announce Type: replace Abstract: The convergence analysis of online learning algorithms is central to machine learning theory, where the last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolutio…
arXiv cs.LG
TIER_1·Zichun Ye, Runqi Wang, Xuchuang Wang, Xutong Liu, Shuai Li, Mohammad Hajiesmaili·
arXiv:2605.00638v1 Announce Type: new Abstract: Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / superv…
arXiv:2602.14474v2 Announce Type: replace Abstract: We study $K$-armed Multiarmed Bandit (MAB) problem with $M$ heterogeneous data sources, each exhibiting unknown and distinct noise variances $\{\sigma_j^2\}_{j=1}^M$. The learner's objective is standard MAB regret minimization, …
arXiv cs.LG
TIER_1·Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, Yun-En Liu·
arXiv:2605.00488v1 Announce Type: new Abstract: In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and presen…
Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / supervised machine unlearning, leaving unlearning for …
In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that inter…
arXiv:2508.08420v3 Announce Type: replace-cross Abstract: We consider the problem of online regret minimization in linear bandits with access to prior observations (offline data) from the underlying bandit model. There are numerous applications where extensive offline data is oft…
arXiv:2605.07115v1 Announce Type: cross Abstract: Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(\alpha\), the natural upper-t…
arXiv stat.ML
TIER_1·Ishank Juneja, Carlee Joe-Wong, Osman Ya\u{g}an·
arXiv:2605.07171v1 Announce Type: cross Abstract: The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the mi…
We study the $\textit{single-index bandit}$ problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particul…
\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $ε^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the sta…
The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the minimum permissible reward, an objective captured by…
Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(α\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1…
Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently …
Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the kernel effective dim…
We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $δ\in (0,1]$, thereby cha…
We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $δ\in (0,1]$, thereby cha…
arXiv:2605.02141v1 Announce Type: cross Abstract: Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Ne…
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regu…