arXiv:2605.22191v1 Announce Type: new Abstract: Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether op…
We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum rew…
We study the contextual multi-armed bandit problem with a finite context space (a.k.a. subpopulations), where the learner recommends a best action for each context and is evaluated by context-weighted simple regret. Our guarantees are worst-case over the reward distributions, whi…
Automated decision-making algorithms drive applications such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward from an exis…
arXiv:2511.10619v2 Announce Type: replace-cross Abstract: The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperpara…
arXiv:2605.20269v1 Announce Type: cross Abstract: Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits…
arXiv:2503.00565v3 Announce Type: replace Abstract: The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. In many practical applications,…
arXiv stat.ML
TIER_1English(EN)·Tong Li, Thiago de Queiroz Casanova, Eric M. Schwartz, Victor Kostyuk, Dehan Kong, Joseph J. Williams·
arXiv:2603.11276v2 Announce Type: replace Abstract: Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--…
Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non…