PulseAugur
实时 20:20:05

新研究推动上下文老虎机算法在动态复杂环境中的应用

研究人员正在探索上下文老虎机问题的先进技术,重点是改进遗憾界限和处理动态环境。一篇论文介绍了一种可重试的老虎机算法,旨在优化多次尝试中的最佳结果,并证明了该目标的首个次线性遗憾界限。另一项研究提出了主动上下文选择以提高上下文老虎机中的简单遗憾,显示出比被动采样显著的改进。此外,一种名为 PONA 的新方法被提出用于离线上下文老虎机,该方法通过利用动作特征可以有效地学习和选择新动作,其性能优于受限于预定义动作集的现有方法。最后,一种名为 RIE-Greedy 的新颖方法使用上下文老虎机中的正则化诱导探索,证明了与 Thompson Sampling 的理论等价性以及实际有效性。 AI

影响 这些论文为上下文老虎机问题引入了新颖的算法和理论分析,有可能改进推荐系统和其他应用中的决策制定。

排序理由 该集群包含多篇关于老虎机算法理论进展的学术论文。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。 我们如何撰写摘要 →

新研究推动上下文老虎机算法在动态复杂环境中的应用

报道来源 [9]

  1. arXiv cs.LG TIER_1 English(EN) · Shuche Wang, Adarsh Barik, Vincent Y. F. Tan ·

    Bandit Convex Optimization with Gradient Prediction Adaptivity

    arXiv:2605.22191v1 Announce Type: new Abstract: Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether op…

  2. arXiv cs.LG TIER_1 English(EN) · Paavo Parmas ·

    Finite-Time Regret Analysis of Retry-Aware Bandits

    We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum rew…

  3. arXiv cs.LG TIER_1 English(EN) · Negar Kiyavash ·

    Active Context Selection Improves Simple Regret in Contextual Bandits

    We study the contextual multi-armed bandit problem with a finite context space (a.k.a. subpopulations), where the learner recommends a best action for each context and is evaluated by context-weighted simple regret. Our guarantees are worst-case over the reward distributions, whi…

  4. arXiv cs.LG TIER_1 English(EN) · Yuta Saito ·

    Offline Contextual Bandits in the Presence of New Actions

    Automated decision-making algorithms drive applications such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward from an exis…

  5. arXiv stat.ML TIER_1 English(EN) · Avrim Blum, Marten Garicano, Kavya Ravichandran, Dravyansh Sharma ·

    Algorithm Design and Stronger Guarantees for the Improving Multi-Armed Bandits Problem

    arXiv:2511.10619v2 Announce Type: replace-cross Abstract: The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperpara…

  6. arXiv stat.ML TIER_1 English(EN) · Hamed Khosravi, Xiaoming Huo ·

    Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

    arXiv:2605.20269v1 Announce Type: cross Abstract: Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits…

  7. arXiv stat.ML TIER_1 English(EN) · Sakshi Arya, Hyebin Song ·

    Batched Single-Index Global Multi-Armed Bandits with Covariates

    arXiv:2503.00565v3 Announce Type: replace Abstract: The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. In many practical applications,…

  8. arXiv stat.ML TIER_1 English(EN) · Tong Li, Thiago de Queiroz Casanova, Eric M. Schwartz, Victor Kostyuk, Dehan Kong, Joseph J. Williams ·

    RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

    arXiv:2603.11276v2 Announce Type: replace Abstract: Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--…

  9. arXiv stat.ML TIER_1 English(EN) · Xiaoming Huo ·

    Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

    Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non…