English(EN) The Multi-Armed Bandit Problem and Its Solutions

研究人员推进贝叶斯优化以实现高效决策和超参数调整

作者 PulseAugur 编辑部 · [33 个来源] · 2018-01-23 00:00

近期几篇arXiv论文探讨了多臂老虎机问题的进展，这是一个在不确定性下进行序贯决策的框架。研究内容包括处理“闪烁多臂老虎机”中变化的动作可用性，以及在不严格假设上下文多样性的情况下改进逻辑老虎机的遗憾界限。其他工作则侧重于几何感知离线到在线学习、图上平滑函数的谱老虎机，以及广义线性上下文老虎机的隐私保护算法。 AI

影响老虎机算法的进步可能导致更高效的在线学习系统，并改进推荐、广告和资源分配中的决策。

排序理由多篇arXiv论文发表了关于多臂老虎机算法的各种理论进展。

在 Lil'Log (Lilian Weng) 阅读 →

AI 生成摘要 · Google Gemini · 来自 33 个来源。我们如何撰写摘要 →

报道来源 [33]

Lil'Log (Lilian Weng) TIER_1 English(EN) · 2018-01-23 00:00

多臂老虎机问题及其解决方案

 <p>The algorithms are implemented for Bernoulli bandit in <a …
arXiv cs.LG TIER_1 English(EN) · Maresa Schr\"oder, Pascal Janetzky, Michael Klar, Stefan Feuerriegel · 2026-05-08 04:00

ORTHOBO：正交贝叶斯超参数优化

arXiv:2605.06454v1 Announce Type: new Abstract: Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mo…
arXiv cs.LG TIER_1 English(EN) · Becky Langdon, Gabriel D. Patr\'on, Chrysoula D. Kappatou, Robert M. Lee, Behrang Shafei, Jixiang Qing, Ruth Misener, Mark van der Wilk, Calvin Tsay · 2026-05-08 04:00

用于批次间歇式过程样本高效贝叶斯优化的元学习

arXiv:2605.05382v1 Announce Type: cross Abstract: The optimisation of fed-batch (bio)chemical process recipes is subject to inherent, underlying, and unmeasurable fluctuations across batches, whose trajectories are difficult to model and costly to measure. Bayesian Optimisation (…
arXiv cs.AI TIER_1 English(EN) · Stefan Feuerriegel · 2026-05-07 15:49

ORTHOBO：正交贝叶斯超参数优化

Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overlooked: even when the…
arXiv cs.LG TIER_1 English(EN) · Noel Thomas · 2026-05-07 04:00

多上下文贝叶斯优化中的政权条件评估

arXiv:2605.04895v1 Announce Type: new Abstract: Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and met…
arXiv cs.LG TIER_1 English(EN) · Sofianos Panagiotis Fotias, Vassilis Gaganis · 2026-05-05 04:00

在贝叶斯优化中引入排列不变性先验以应用于碳捕获与封存

arXiv:2605.02409v1 Announce Type: new Abstract: Bayesian Optimization is an iterative method, tailored to optimizing expensive black box objective functions. Surrogate models like Gaussian Processes, which are the gold standard in Bayesian Optimization, can be inefficient for inp…
arXiv cs.LG TIER_1 English(EN) · Sofianos Panagiotis Fotias, Vassilis Gaganis · 2026-05-05 04:00

用于碳捕获与封存操作贝叶斯优化的贝叶斯神经网络代理

arXiv:2507.21803v2 Announce Type: replace Abstract: Carbon Capture and Storage (CCS) stands as a pivotal technology for fostering a sustainable future. The process, which involves injecting supercritical CO$_2$ into underground formations, a method already widely used for Enhance…
arXiv cs.LG TIER_1 English(EN) · Vassilis Gaganis · 2026-05-04 09:55

为碳捕获与封存应用中的贝叶斯优化引入置换不变先验

Bayesian Optimization is an iterative method, tailored to optimizing expensive black box objective functions. Surrogate models like Gaussian Processes, which are the gold standard in Bayesian Optimization, can be inefficient for inputs with permutation symmetries, as the most com…
arXiv cs.LG TIER_1 English(EN) · Jesse Schneider, William J. Welch · 2026-05-04 04:00

Bayesian Optimization in Linear Time

arXiv:2605.00237v1 Announce Type: new Abstract: Bayesian optimization is a sequential method for minimizing objective functions that are expensive to evaluate and about which few assumptions can be made. By using all gathered data to train a Gaussian process model for the functio…
arXiv cs.AI TIER_1 Deutsch(DE) · Ziwei Zhang, Jonathan Yu-Meng Li · 2026-05-01 04:00

生成模型下的采样器鲁棒优化

arXiv:2604.27447v1 Announce Type: cross Abstract: Modern stochastic optimization pipelines increasingly rely on learned generative models to represent uncertainty, while downstream decisions are evaluated almost entirely through Monte Carlo scenarios. This shifts the operational …
Hugging Face Daily Papers TIER_1 Deutsch(DE) · 2026-04-30 05:33

生成模型下的采样鲁棒优化

Modern stochastic optimization pipelines increasingly rely on learned generative models to represent uncertainty, while downstream decisions are evaluated almost entirely through Monte Carlo scenarios. This shifts the operational object of uncertainty from an explicit probability…
arXiv cs.LG TIER_1 English(EN) · Zean Han, Ruihan Lin, Zezhen Ding, Jiheng Zhang · 2026-04-28 04:00

线性上下文老虎机中的几何感知离线到在线学习

arXiv:2604.24016v1 Announce Type: new Abstract: We study offline-to-online learning in linear contextual bandits with biased offline regression data: the offline parameter need not match the online one, so history should not be treated as a single warm start. We model directional…
arXiv cs.LG TIER_1 English(EN) · Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen · 2026-04-28 04:00

闪烁的多臂老虎机

arXiv:2602.17315v2 Announce Type: replace Abstract: We introduce Flickering Multi-Armed Bandits (FMAB) to model sequential decision-making in environments with changing action availability, where accessibility of the next action is restricted to a subset dependent on the agent's …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-27 22:05

具有偏好反馈的贝叶斯优化的Thompson Sampling的有限时间分析

Preference feedback, in the form of pairwise comparisons rather than scalar scores, has seen increasing use in applications such as human-, laboratory-, and expert-in-the-loop design, as well as scientific discovery. We propose a Thompson Sampling (TS) approach to Bayesian optimi…
arXiv cs.LG TIER_1 English(EN) · Seoungbin Bae, Dabeen Lee · 2026-04-27 04:00

具有 $\tilde{O}(\sqrt{dT})$ 遗憾的物流土匪，无需上下文多样性假设

arXiv:2604.22161v1 Announce Type: new Abstract: We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound r…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-27 03:53

线性上下文老虎机中的几何感知离线到在线学习

We study offline-to-online learning in linear contextual bandits with biased offline regression data: the offline parameter need not match the online one, so history should not be treated as a single warm start. We model directional transfer with a shift certificate $(M_{\mathrm{…
arXiv cs.LG TIER_1 English(EN) · Dabeen Lee · 2026-04-24 02:21

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound rely heavily on context diversity assumptions, su…
arXiv stat.ML TIER_1 English(EN) · Noel Thomas · 2026-05-06 13:27

多上下文贝叶斯优化中的制度条件评估

Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. An audit of 40 transfer-BO papers from Neur…
arXiv stat.ML TIER_1 English(EN) · Abhirami Pillai · 2026-04-30 04:00

预算受限因果老虎机：连接提升建模与序贯决策

arXiv:2604.26169v1 Announce Type: cross Abstract: Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offl…
arXiv stat.ML TIER_1 English(EN) · Gergely Neu, Michal Valko · 2026-04-29 04:00

具有随机决策集和对抗性损失的在线组合优化

arXiv:2604.25269v1 Announce Type: cross Abstract: Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments…
arXiv stat.ML TIER_1 English(EN) · Tom\'a\v{s} Koc\'ak, R\'emi Munos, Branislav Kveton, Shipra Agrawal, Michal Valko · 2026-04-29 04:00

谱带土匪

arXiv:2604.25272v1 Announce Type: new Abstract: Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learn…
arXiv stat.ML TIER_1 English(EN) · Abhirami Pillai · 2026-04-28 23:24

预算受限因果老虎机：连接提升建模与序贯决策

Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to es…
arXiv stat.ML TIER_1 English(EN) · Michal Valko · 2026-04-28 06:29

谱带

Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as conten…
arXiv stat.ML TIER_1 English(EN) · Michal Valko · 2026-04-28 06:28

具有随机决策集和对抗性损失的在线组合优化

Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stoc…
arXiv stat.ML TIER_1 English(EN) · Tomas Kocak, Gergely Neu, Michal Valko, Remi Munos · 2026-04-28 04:00

具有侧面观测的 Bandit 问题中的隐式探索实现高效学习

arXiv:2604.24555v1 Announce Type: cross Abstract: We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that i…
arXiv stat.ML TIER_1 English(EN) · Sahasrajit Sarmasarkar · 2026-04-28 04:00

Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits

arXiv:2602.00417v2 Announce Type: replace Abstract: We present the first algorithms for generalized linear contextual bandits under shuffle differential privacy and joint differential privacy. While prior work on private contextual bandits has been restricted to linear reward mod…
arXiv stat.ML TIER_1 English(EN) · Remi Munos · 2026-04-27 14:45

具有侧面观测的 the bandit problem 的隐式探索的高效学习

We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets …
arXiv stat.ML TIER_1 English(EN) · Jierui Zuo, Hanzhang Qin · 2026-04-27 04:00

关于参数选择老虎机帕累托最优性

arXiv:2501.19277v4 Announce Type: replace Abstract: We study online assortment optimization under stochastic choice when a decision maker simultaneously values cumulative revenue performance and the quality of post-hoc inference on revenue contrasts. We analyze a forced-explorati…
arXiv stat.ML TIER_1 English(EN) · Mat\'ias Carrasco, Alejandro Cholaquidis · 2026-04-27 04:00

通过影响函数梯度实现的凹统计效用最大化老虎机

arXiv:2604.22140v1 Announce Type: new Abstract: We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon …
arXiv stat.ML TIER_1 English(EN) · Alejandro Cholaquidis · 2026-04-24 01:13

通过影响函数梯度实现的凹统计效用最大化老虎机

We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mi…
arXiv stat.ML TIER_1 English(EN) · Alejandro Cholaquidis · 2026-04-24 01:13

通过影响函数梯度实现凹统计效用最大化老虎机

We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mi…
arXiv stat.ML TIER_1 English(EN) · Michal Valko · 2026-04-23 08:48

一种算法同时处理活跃和休眠的腐烂土匪

In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are re…
Eugene Yan TIER_1 English(EN) · 2022-05-08 00:00

Bandits for Recommender Systems

Industry examples, exploration strategies, warm-starting, off-policy evaluation, and more.

报道来源 [33]

相关实体

相关话题