New bandit algorithms tackle replicability, kernel complexity, and streaming data
ByPulseAugur Editorial·[14 sources]·
Researchers have published several new papers detailing advancements in multi-armed bandit algorithms. One study introduces replicable UCB-based exploration methods for stochastic and linear bandits, improving regret guarantees. Another paper unifies Gaussian-process UCB and decision-estimation-coefficient methods for kernel bandits, highlighting the distinction between algorithmic information and minimax complexity. Additionally, new algorithms address sliding-window streaming bandits with limited memory and contextual queueing bandits, achieving improved regret rates and characterizing minimax dependencies.
AI
IMPACT
Advances in bandit algorithms can lead to more efficient online learning systems for recommendation engines, resource allocation, and experimentation platforms.
RANK_REASON
Multiple academic papers published on arXiv detailing new algorithms and theoretical analyses in the field of multi-armed bandits.
arXiv:2606.11192v1 Announce Type: new Abstract: We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors. For the associated belief-state model, we develop a partial conservation laws (PCL)-b…
arXiv cs.LG
TIER_1English(EN)·Rohan Deb, Udaya Ghai, Karan Singh, Arindam Banerjee·
arXiv:2604.20024v2 Announce Type: replace Abstract: We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is $\rho$-replicable if two executions using shared internal rand…
arXiv:2606.11171v1 Announce Type: new Abstract: Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods may appear, at first sight, to belong to different theories. This paper places the two viewpoints in a common algorithmic-information…
Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods may appear, at first sight, to belong to different theories. This paper places the two viewpoints in a common algorithmic-information language for frequentist RKHS bandits. GP-UCB f…
arXiv:2606.09668v1 Announce Type: new Abstract: Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ qu…
arXiv:2606.08977v1 Announce Type: new Abstract: Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward dis…
We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under pract…
Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected diffe…
arXiv:2606.11968v1 Announce Type: cross Abstract: This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over $K+1$ outcomes follows a multinomial logistic model of $d$-dimensional action vectors. A representative …
This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over $K+1$ outcomes follows a multinomial logistic model of $d$-dimensional action vectors. A representative UCB-type algorithm, OFUL-MLogB, achieves a regret …
arXiv:2606.09802v1 Announce Type: cross Abstract: We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context dist…
arXiv:2606.09002v1 Announce Type: new Abstract: We study a stochastic multi-armed bandit problem in which the set of available arms expands over time. This setting arises in sequential experimentation when new actions or treatments become available during an ongoing study, making…
We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under pract…
We study a stochastic multi-armed bandit problem in which the set of available arms expands over time. This setting arises in sequential experimentation when new actions or treatments become available during an ongoing study, making regret against a single best arm in hindsight i…