English(EN) Regret Minimization with Adaptive Opponents in Repeated Games

新的 Bandit 算法应对对抗性攻击和复杂应用

作者 PulseAugur 编辑部 · [33 个来源] · 2026-05-26 04:00

研究人员正在探索 Bandit 算法的新前沿，重点关注其在复杂场景中的应用和鲁棒性。一篇论文研究了高维离线 Bandit 上的对抗性攻击，揭示了用于评估生成式 AI 的奖励模型的漏洞。其他研究深入探讨了理论进展，例如方差敏感 Thompson 采样、重试感知 Bandit 的有限时间遗憾分析以及对抗性线性上下文 Bandit 的改进算法。此外，还有研究考察了 Bandit 在潜在状态环境、具有延迟反馈的决斗 Bandit，甚至深度脑刺激中的应用，突显了该算法的多功能性。 AI

影响 Bandit 算法的进步增强了对生成模型的评估，并为 AI 在医疗保健和推荐系统中的应用开辟了新途径。

排序理由多篇 arXiv 论文详细介绍了 Bandit 算法的理论进展和应用。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 33 个来源。我们如何撰写摘要 →

报道来源 [33]

arXiv cs.LG TIER_1 Deutsch(DE) · Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen · 2026-06-05 04:00

多智能体利普希茨赌徒

arXiv:2602.16965v2 Announce Type: replace Abstract: We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximize…
arXiv cs.LG TIER_1 English(EN) · Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor · 2026-06-05 04:00

通过排序均方误差进行奖励学习

arXiv:2601.09236v3 Announce Type: replace Abstract: Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manuall…
arXiv cs.LG TIER_1 English(EN) · Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, Kaiqing Zhang · 2026-06-05 04:00

在重复博弈中通过自适应对手实现遗憾最小化

arXiv:2606.06486v1 Announce Type: new Abstract: In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture su…
arXiv cs.AI TIER_1 English(EN) · Kaiqing Zhang · 2026-06-04 17:59

在重复博弈中通过自适应对手实现遗憾最小化

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfa…
arXiv cs.AI TIER_1 English(EN) · Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah · 2026-06-04 04:00

面向高维离线老虎机的有效对抗性攻击

arXiv:2602.01658v2 Announce Type: replace-cross Abstract: Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without e…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

在重复博弈中通过自适应对手实现遗憾最小化

Repeated policy regret provides a game-theoretic framework for analyzing adaptive opponents in repeated games, offering stronger equilibrium guarantees than traditional external regret through novel non-convex optimization algorithms.
arXiv cs.LG TIER_1 English(EN) · Katherine Avery, Chinmay Pendse, David Jensen · 2026-06-02 04:00

在不确定的因果机制下评估和学习鲁棒的 Bandit 策略

arXiv:2508.02812v3 Announce Type: replace Abstract: Causal graphical models can encode large amounts structural knowledge, both from the background knowledge of domain experts and the structural knowledge discovered from randomized experiments or observational data. However, thou…
arXiv cs.LG TIER_1 English(EN) · Tom Perneczky, Marc Abeille, David Janz · 2026-06-02 04:00

广义线性老虎机问题的方差敏感Thompson采样，再探

arXiv:2606.00431v1 Announce Type: new Abstract: We prove a variance-sensitive regret bound for Thompson sampling in stochastic generalised linear bandits. The argument assumes a warm-up, after which the regret is controlled through using the Gaussian Poincar\'e inequality. This b…
arXiv cs.LG TIER_1 English(EN) · Tim van Erven, Jack Mayo, Julia Olkhovskaya, Chen-Yu Wei · 2026-06-02 04:00

一种改进的对抗性线性上下文老虎机算法，通过归约实现

arXiv:2508.11931v3 Announce Type: replace Abstract: We present an oracle-efficient, near-optimal algorithm for linear contextual bandits with adversarial losses and stochastic action sets, only requiring a linear optimization oracle for the action sets in each round. Our approach…
arXiv cs.LG TIER_1 English(EN) · Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang · 2026-06-02 04:00

自适应探索用于潜在状态老虎机

arXiv:2602.05139v3 Announce Type: replace Abstract: We study bandits whose rewards depend on an unobserved Markov state that evolves independently of the learner's actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algor…
arXiv cs.LG TIER_1 English(EN) · Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas · 2026-06-02 04:00

有限时间后悔的重试感知老虎机分析

arXiv:2605.20854v2 Announce Type: replace Abstract: We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distributi…
arXiv cs.AI TIER_1 English(EN) · William Overman, Mohsen Bayati · 2026-06-01 04:00

多臂贝叶斯老虎机中的退火贪婪Softmax

arXiv:2605.31034v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those…
arXiv cs.LG TIER_1 English(EN) · Shogo Iwazaki · 2026-05-29 04:00

对抗性核化老虎机近乎最优算法

arXiv:2605.10299v2 Announce Type: replace Abstract: This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each roun…
arXiv cs.LG TIER_1 English(EN) · Arkaprava Gupta, Nicholas Carter, William Zellers, Prateek Ganguli, Benedikt Dietrich, Vibhor Krishna, Parasara Sridhar Duggirala, Samarjit Chakraborty · 2026-05-29 04:00

用于深度脑刺激的土匪算法

arXiv:2601.12699v2 Announce Type: replace Abstract: Deep Brain Stimulation (DBS) is an effective treatment for Parkinson's disease, but conventional fixed-parameter stimulation can reduce battery life and cause side effects while failing to adapt to changing neural dynamics. Rece…
arXiv cs.AI TIER_1 English(EN) · Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong, Zhi Hong, Zhiyong Wang, Zhongxiang Dai · 2026-05-27 04:00

具有延迟反馈的线性与神经对策土匪

arXiv:2605.26554v1 Announce Type: cross Abstract: Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption…
arXiv cs.LG TIER_1 English(EN) · Yu-Jie Zhang, Hao Qiu, Jonathan Scarlett, Kevin Jamieson · 2026-05-27 04:00

对抗性核函数赌博中的近乎最优遗憾

arXiv:2605.26585v1 Announce Type: new Abstract: We study the adversarial kernel bandit problem, in which the loss at each round is induced by an arbitrary bounded element of a reproducing kernel Hilbert space (RKHS). We propose an exponential-weights algorithm built on a regulari…
arXiv cs.LG TIER_1 English(EN) · Emma Brunskill, Ishani Karmarkar, Zhaoqi Li · 2026-05-26 04:00

用于随机上下文线性赌博机的活动学习

arXiv:2605.24803v1 Announce Type: new Abstract: A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the …
arXiv cs.LG TIER_1 English(EN) · Daniel Ezer, Alon Peled-Cohen, Yishay Mansour · 2026-05-26 04:00

带参数噪声的随机线性老虎机

arXiv:2601.23164v2 Announce Type: replace Abstract: We study the stochastic linear bandits with parameter noise model, in which the reward of action $a$ is $a^\top \theta$ where $\theta$ is sampled i.i.d. We show a regret upper bound of $\widetilde{O} (\sqrt{d T \log (K/\delta) \…
arXiv stat.ML TIER_1 English(EN) · Foo Hui-Mean, Yuan-chin I Chang · 2026-06-04 04:00

ALMAB-DC：用于顺序实验设计和黑盒优化的主动学习、多臂老虎机和分布式计算

arXiv:2603.21180v4 Announce Type: replace-cross Abstract: Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from eac…
arXiv stat.ML TIER_1 English(EN) · Kushagra Chandak, Toshinori Kitamura, Xiaoqi Tan · 2026-06-04 04:00

线性老虎机中的离线到在线学习

arXiv:2606.04305v1 Announce Type: cross Abstract: We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured envir…
arXiv stat.ML TIER_1 English(EN) · Marc Abeille, David Janz, Ciara Pike-Burke · 2026-06-04 04:00

随机探索何时以及为何有效（在线性老虎机中）

arXiv:2502.08870v2 Announce Type: replace-cross Abstract: We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear…
arXiv stat.ML TIER_1 English(EN) · Youngmin Oh, Jinje Park, Taejin Paik · 2026-06-04 04:00

深度表示与浅层探索的神经方差感知对偶老虎机

arXiv:2506.01250v3 Announce Type: replace-cross Abstract: We introduce the first variance-aware algorithms for contextual dueling bandits that leverage shallow exploration strategies with neural networks for nonlinear utility approximation. A key theoretical challenge is the abse…
arXiv stat.ML TIER_1 English(EN) · Xiaoqi Tan · 2026-06-03 00:18

线性老虎机中的离线到在线学习

We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that…
arXiv stat.ML TIER_1 English(EN) · Samya Praharaj, Chih-Yu Chang, Koulik Khamaru, Kelly W. Zhang · 2026-06-02 04:00

Bandit Simulation for Average Reward Inference

arXiv:2606.00913v1 Announce Type: new Abstract: Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natur…
arXiv stat.ML TIER_1 English(EN) · Zhen Li (LMO, CELESTE, HEC Paris), Gilles Stoltz (LMO, CELESTE, HEC Paris) · 2026-06-02 04:00

一种直接处理具有潜在状态动态的上下文老虎机的方法

arXiv:2604.08149v2 Announce Type: replace-cross Abstract: We consider a linear contextual bandit model where contexts and rewards are governed by a finite hidden Markov chain. We first revisit the simplified model by Nelson et al. (2022), in which rewards are linear functions of …
arXiv stat.ML TIER_1 English(EN) · Sanghoon Yu, Min-hwan Oh · 2026-06-02 04:00

具有稀疏参数更新的线性上下文老虎机问题的实用最优算法

arXiv:2606.00984v1 Announce Type: new Abstract: We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting act…
arXiv stat.ML TIER_1 English(EN) · Andrew Jacobsen, Dorian Baudry, Shinji Ito, Nicol\`o Cesa-Bianchi · 2026-06-01 04:00

A Perturbation Approach to Unconstrained Linear Bandits

arXiv:2603.28201v2 Announce Type: replace-cross Abstract: We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approa…
arXiv stat.ML TIER_1 English(EN) · Ivan Lau, Daniel McMorrow, Kevin Jamieson, Jonathan Scarlett · 2026-06-01 04:00

具有1比特通信约束的批量随机线性老虎机

arXiv:2605.30976v1 Announce Type: new Abstract: We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pull…
arXiv stat.ML TIER_1 English(EN) · Min-hwan Oh · 2026-05-31 03:46

具有稀疏参数更新的线性上下文老虎机问题的实用最优算法

We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a pr…
arXiv stat.ML TIER_1 English(EN) · Kelly W. Zhang · 2026-05-30 22:27

Bandit Simulation for Average Reward Inference

Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confi…
arXiv stat.ML TIER_1 English(EN) · Jonathan Scarlett · 2026-05-29 08:17

具有1比特通信约束的批量随机线性老虎机

We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pulls to an agent, who then observes the correspondi…
arXiv stat.ML TIER_1 English(EN) · Seoungbin Bae, Dabeen Lee · 2026-05-29 04:00

神经逻辑老虎机

arXiv:2505.02069v2 Announce Type: replace-cross Abstract: We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencie…
arXiv stat.ML TIER_1 English(EN) · Chaiwon Kim, Jongyeong Lee, Min-hwan Oh · 2026-05-29 04:00

解耦老虎机中的跟随扰动领导者：兼顾最优与实用性

arXiv:2510.12152v2 Announce Type: replace Abstract: We study the decoupled multi-armed bandit problem, where the learner separately selects one arm for exploration and one, possibly different, arm for exploitation at each round. In this setting, the loss of the explored arm is ob…

报道来源 [33]

相关实体

相关话题