New bandit algorithms tackle adversarial attacks and complex applications
ByPulseAugur Editorial·[33 sources]·
Researchers are exploring new frontiers in bandit algorithms, focusing on their application and robustness in complex scenarios. One paper investigates adversarial attacks on high-dimensional offline bandits, revealing vulnerabilities in reward models used for evaluating generative AI. Other research delves into theoretical advancements, such as variance-sensitive Thompson sampling, finite-time regret analysis for retry-aware bandits, and improved algorithms for adversarial linear contextual bandits. Additionally, studies are examining bandit applications in latent-state environments, dueling bandits with delayed feedback, and even deep brain stimulation, highlighting the algorithm's versatility.
AI
IMPACT
Advances in bandit algorithms enhance evaluation of generative models and open new avenues for AI applications in healthcare and recommendation systems.
RANK_REASON
Multiple arXiv papers detailing theoretical advancements and applications of bandit algorithms.
arXiv:2602.16965v2 Announce Type: replace Abstract: We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximize…
arXiv cs.LG
TIER_1English(EN)·Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor·
arXiv:2601.09236v3 Announce Type: replace Abstract: Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manuall…
arXiv:2606.06486v1 Announce Type: new Abstract: In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture su…
In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfa…
arXiv cs.AI
TIER_1English(EN)·Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah·
arXiv:2602.01658v2 Announce Type: replace-cross Abstract: Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without e…
Repeated policy regret provides a game-theoretic framework for analyzing adaptive opponents in repeated games, offering stronger equilibrium guarantees than traditional external regret through novel non-convex optimization algorithms.
arXiv cs.LG
TIER_1English(EN)·Katherine Avery, Chinmay Pendse, David Jensen·
arXiv:2508.02812v3 Announce Type: replace Abstract: Causal graphical models can encode large amounts structural knowledge, both from the background knowledge of domain experts and the structural knowledge discovered from randomized experiments or observational data. However, thou…
arXiv cs.LG
TIER_1English(EN)·Tom Perneczky, Marc Abeille, David Janz·
arXiv:2606.00431v1 Announce Type: new Abstract: We prove a variance-sensitive regret bound for Thompson sampling in stochastic generalised linear bandits. The argument assumes a warm-up, after which the regret is controlled through using the Gaussian Poincar\'e inequality. This b…
arXiv cs.LG
TIER_1English(EN)·Tim van Erven, Jack Mayo, Julia Olkhovskaya, Chen-Yu Wei·
arXiv:2508.11931v3 Announce Type: replace Abstract: We present an oracle-efficient, near-optimal algorithm for linear contextual bandits with adversarial losses and stochastic action sets, only requiring a linear optimization oracle for the action sets in each round. Our approach…
arXiv:2602.05139v3 Announce Type: replace Abstract: We study bandits whose rewards depend on an unobserved Markov state that evolves independently of the learner's actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algor…
arXiv cs.LG
TIER_1English(EN)·Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas·
arXiv:2605.20854v2 Announce Type: replace Abstract: We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distributi…
arXiv:2605.31034v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those…
arXiv:2605.10299v2 Announce Type: replace Abstract: This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each roun…
arXiv cs.LG
TIER_1English(EN)·Arkaprava Gupta, Nicholas Carter, William Zellers, Prateek Ganguli, Benedikt Dietrich, Vibhor Krishna, Parasara Sridhar Duggirala, Samarjit Chakraborty·
arXiv:2601.12699v2 Announce Type: replace Abstract: Deep Brain Stimulation (DBS) is an effective treatment for Parkinson's disease, but conventional fixed-parameter stimulation can reduce battery life and cause side effects while failing to adapt to changing neural dynamics. Rece…
arXiv:2605.26554v1 Announce Type: cross Abstract: Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption…
arXiv cs.LG
TIER_1English(EN)·Yu-Jie Zhang, Hao Qiu, Jonathan Scarlett, Kevin Jamieson·
arXiv:2605.26585v1 Announce Type: new Abstract: We study the adversarial kernel bandit problem, in which the loss at each round is induced by an arbitrary bounded element of a reproducing kernel Hilbert space (RKHS). We propose an exponential-weights algorithm built on a regulari…
arXiv:2605.24803v1 Announce Type: new Abstract: A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the …
arXiv:2601.23164v2 Announce Type: replace Abstract: We study the stochastic linear bandits with parameter noise model, in which the reward of action $a$ is $a^\top \theta$ where $\theta$ is sampled i.i.d. We show a regret upper bound of $\widetilde{O} (\sqrt{d T \log (K/\delta) \…
arXiv stat.ML
TIER_1English(EN)·Foo Hui-Mean, Yuan-chin I Chang·
arXiv:2603.21180v4 Announce Type: replace-cross Abstract: Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from eac…
arXiv:2606.04305v1 Announce Type: cross Abstract: We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured envir…
arXiv stat.ML
TIER_1English(EN)·Marc Abeille, David Janz, Ciara Pike-Burke·
arXiv:2502.08870v2 Announce Type: replace-cross Abstract: We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear…
arXiv stat.ML
TIER_1English(EN)·Youngmin Oh, Jinje Park, Taejin Paik·
arXiv:2506.01250v3 Announce Type: replace-cross Abstract: We introduce the first variance-aware algorithms for contextual dueling bandits that leverage shallow exploration strategies with neural networks for nonlinear utility approximation. A key theoretical challenge is the abse…
We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that…
arXiv stat.ML
TIER_1English(EN)·Samya Praharaj, Chih-Yu Chang, Koulik Khamaru, Kelly W. Zhang·
arXiv:2606.00913v1 Announce Type: new Abstract: Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natur…
arXiv:2604.08149v2 Announce Type: replace-cross Abstract: We consider a linear contextual bandit model where contexts and rewards are governed by a finite hidden Markov chain. We first revisit the simplified model by Nelson et al. (2022), in which rewards are linear functions of …
arXiv:2606.00984v1 Announce Type: new Abstract: We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting act…
arXiv:2603.28201v2 Announce Type: replace-cross Abstract: We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approa…
arXiv stat.ML
TIER_1English(EN)·Ivan Lau, Daniel McMorrow, Kevin Jamieson, Jonathan Scarlett·
arXiv:2605.30976v1 Announce Type: new Abstract: We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pull…
We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a pr…
Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confi…
We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pulls to an agent, who then observes the correspondi…
arXiv:2505.02069v2 Announce Type: replace-cross Abstract: We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencie…
arXiv stat.ML
TIER_1English(EN)·Chaiwon Kim, Jongyeong Lee, Min-hwan Oh·
arXiv:2510.12152v2 Announce Type: replace Abstract: We study the decoupled multi-armed bandit problem, where the learner separately selects one arm for exploration and one, possibly different, arm for exploitation at each round. In this setting, the loss of the explored arm is ob…