PulseAugur
实时 18:36:54
English(EN) RL²: Fast reinforcement learning via slow reinforcement learning

OpenAI 通过 Dota 2、安全性和泛化性推进强化学习

OpenAI 发布了一系列研究论文,详细介绍了强化学习方面的进展。其中包括 OpenAI FiveDota 2 中取得超越人类的表现,开发了 RL 安全探索的基准,并使用 CoinRun 环境量化了泛化能力。该公司还探索了新颖的方法,例如基于预测的奖励以实现好奇心驱动的探索,学习多智能体系统中的策略表示,以及一种名为 Evolved Policy Gradients 的实验性元学习方法,以加快新任务的训练速度。进一步的研究解决了策略梯度的方差缩减问题以及策略梯度与软 Q 学习之间的等价性,并挑战了多目标 RL 的机器人环境。 AI

影响 展示了 RL 能力的重大进展,包括超越人类的表现、安全性、泛化性和探索性,拓展了 AI 的边界。

排序理由 OpenAI 发布了多篇关于强化学习各个方面的研究论文。

在 OpenAI News 阅读 →

AI 生成摘要 · Google Gemini · 来自 870 个来源。 我们如何撰写摘要 →

OpenAI 通过 Dota 2、安全性和泛化性推进强化学习

报道来源 [870]

  1. OpenAI News TIER_1 English(EN) ·

    Dota 2 结合大规模深度强化学习

  2. OpenAI News TIER_1 English(EN) ·

    深度强化学习中安全探索的基准测试

  3. OpenAI News TIER_1 English(EN) ·

    量化强化学习中的泛化能力

    We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in reinforcement learning. CoinRun strikes a desirable balance in complexity: the env…

  4. OpenAI News TIER_1 English(EN) ·

    基于预测的奖励的强化学习

    We’ve developed Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first time exceeds average human performance on Montezuma’s Revenge.

  5. OpenAI News TIER_1 English(EN) ·

    多智能体系统中的学习策略表示

  6. OpenAI News TIER_1 English(EN) ·

    演进策略梯度

    We’re releasing an experimental metalearning approach called Evolved Policy Gradients, a method that evolves the loss function of learning agents, which can enable fast training on novel tasks. Agents trained with EPG can succeed at basic tasks at test time that were outside thei…

  7. OpenAI News TIER_1 English(EN) ·

    用于策略梯度策略的方差缩减,带有关联因子基线

  8. OpenAI News TIER_1 English(EN) ·

    关于通过元强化学习进行探索性学习的一些思考

  9. OpenAI News TIER_1 English(EN) ·

    多目标强化学习:机器人环境挑战与研究呼吁

  10. OpenAI News TIER_1 English(EN) ·

    策略梯度与软 Q-learning 的等价性

  11. OpenAI News TIER_1 English(EN) ·

    用于分层强化学习的随机神经网络

  12. OpenAI News TIER_1 English(EN) ·

    #探索:深度强化学习中基于计数的探索研究

  13. OpenAI News TIER_1 English(EN) ·

    RL²:通过慢速强化学习实现快速强化学习

  14. Apple Machine Learning Research TIER_1 English(EN) ·

    PORTool:具有奖励树的面向重要性的策略优化,用于多工具集成推理

    Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents using outcome-only rewards suffers from credit-assignment ambiguity, obscuring which…

  15. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Reward Hacking in Reinforcement Learning

    <p>Reward hacking occurs when a <a href="(https://lilianweng.github.io/posts/2018-02-19-rl-overview/)">reinforcement learning (RL)</a> agent <a href="https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration">exploits</a> flaws or ambiguities i…

  16. Hugging Face Blog TIER_1 English(EN) ·

    推出 ⚔️ AI vs. AI ⚔️ 深度强化学习多智能体竞赛系统

  17. Hugging Face Blog TIER_1 English(EN) ·

    图解人类反馈强化学习 (RLHF)

  18. Hugging Face Blog TIER_1 English(EN) ·

    深度强化学习入门

  19. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    深度强化学习中的探索策略

    <!-- Exploitation versus exploration is a critical topic in reinforcement learning. This post introduces several common approaches for better exploration in Deep RL. --> <p><span class="update">[Updated on 2020-06-17: Add <a href="#exploration-via-disagreement">&ldquo;exploration…

  20. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    强化学习课程

    <!-- A curriculum is an efficient tool for humans to progressively learn from simple concepts to hard problems. It breaks down complex knowledge by providing a sequence of learning steps of increasing difficulty. In this post, we will examine how the idea of curriculum can help r…

  21. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Meta Reinforcement Learning

    <!-- Meta-RL is meta-learning on reinforcement learning tasks. After trained over a distribution of tasks, the agent is able to solve a new task by developing a new RL algorithm with its internal activity dynamics. This post starts with the origin of meta-RL and then dives into t…

  22. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    使用 Tensorflow + OpenAI Gym 实现深度强化学习模型

    <!-- Let's see how to implement a number of classic deep reinforcement learning models in code. --> <p>The full implementation is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">lilianweng/deep-reinforcement-learning-gym</a></p> <p>In the prev…

  23. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    策略梯度算法

    <!-- Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, …

  24. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    深入了解强化学习

    <!-- In this post, we are gonna briefly go over the field of Reinforcement Learning (RL), from fundamental concepts to classic algorithms. Hopefully, this review is helpful enough so that newbies would not get lost in specialized terms and jargons while starting. [WARNING] This i…

  25. Andrej Karpathy TIER_1 English(EN) · Andrej Karpathy ·

    Pong AI with Policy Gradients

    Trained for ~8000 episodes, each episode = ~30 games. Updates were done in batches of 10 episodes, so ~800 updates total. Policy network is a 2-layer neural net connected to raw pixels, with 200 hidden units. Trained with RMSProp and learning rate 1e-4. The final agent does not b…

  26. arXiv cs.LG TIER_1 English(EN) · Hsiao-Ru Pan, Bernhard Sch\"olkopf ·

    Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

    arXiv:2606.20411v1 Announce Type: new Abstract: Direct Advantage Estimation (DAE) has been shown to improve the sample efficiency of deep reinforcement learning algorithms. However, its reliance on full environment observability limits its applicability in realistic settings, and…

  27. arXiv cs.AI TIER_1 English(EN) · Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack ·

    Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

    arXiv:2606.19357v1 Announce Type: cross Abstract: We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroll…

  28. arXiv cs.AI TIER_1 English(EN) · Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang ·

    MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

    arXiv:2506.14990v3 Announce Type: replace Abstract: Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 se…

  29. arXiv cs.AI TIER_1 English(EN) · ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim ·

    MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

    arXiv:2510.18383v3 Announce Type: replace-cross Abstract: Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor…

  30. arXiv cs.LG TIER_1 English(EN) · Bernhard Schölkopf ·

    Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

    Direct Advantage Estimation (DAE) has been shown to improve the sample efficiency of deep reinforcement learning algorithms. However, its reliance on full environment observability limits its applicability in realistic settings, and its requirement to model transition probabiliti…

  31. arXiv cs.AI TIER_1 English(EN) · Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen ·

    Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

    arXiv:2606.18820v1 Announce Type: cross Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoff…

  32. arXiv cs.AI TIER_1 English(EN) · Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang ·

    TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

    arXiv:2606.18308v1 Announce Type: cross Abstract: Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these…

  33. arXiv cs.AI TIER_1 English(EN) · Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas ·

    Self-CTRL: Self-Consistency Training with Reinforcement Learning

    arXiv:2606.18327v1 Announce Type: cross Abstract: Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that …

  34. arXiv cs.AI TIER_1 English(EN) · Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang ·

    Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

    arXiv:2606.18810v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routi…

  35. arXiv cs.LG TIER_1 English(EN) · Yiyan Huang, Cheuk Hang Leung, Qi Wu, Zhiheng Zhang ·

    Wasserstein Policy Learning for Distributional Outcomes

    arXiv:2606.19117v1 Announce Type: cross Abstract: Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare …

  36. arXiv cs.LG TIER_1 English(EN) · Xuanfei Ren, Tengyang Xie ·

    When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

    arXiv:2606.18531v1 Announce Type: cross Abstract: Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimizat…

  37. arXiv cs.CL TIER_1 English(EN) · Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen ·

    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

    arXiv:2606.18902v1 Announce Type: new Abstract: Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (AP…

  38. arXiv cs.AI TIER_1 English(EN) · Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart ·

    UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

    arXiv:2606.19328v1 Announce Type: cross Abstract: Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suff…

  39. arXiv cs.AI TIER_1 English(EN) · Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao ·

    Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

    arXiv:2606.18831v1 Announce Type: cross Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as …

  40. arXiv cs.AI TIER_1 English(EN) · Nicholas Rhinehart ·

    UBP2:用于高效偏好学习的基于偏好的不确定性平衡规划

    Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during …

  41. arXiv cs.CL TIER_1 English(EN) · Jinghong Chen ·

    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

    Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stocha…

  42. Hugging Face Daily Papers TIER_1 English(EN) ·

    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

    Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stocha…

  43. arXiv cs.AI TIER_1 English(EN) · Chaojun Xiao ·

    超越奖励工程:长上下文强化学习的数据配方

    Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, ye…

  44. arXiv cs.AI TIER_1 English(EN) · Xuebin Chen ·

    成熟的马尔可夫决策过程:信息增加和行动集缩减下的决策制定

    Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard …

  45. Hugging Face Daily Papers TIER_1 English(EN) ·

    Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning …

  46. arXiv cs.AI TIER_1 English(EN) · Heyan Huang ·

    从自身解决方案中学习:具有可验证奖励的强化学习的自条件信用分配

    Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning …

  47. arXiv cs.AI TIER_1 English(EN) · Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu ·

    Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

    arXiv:2606.17735v1 Announce Type: new Abstract: Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations int…

  48. arXiv cs.AI TIER_1 English(EN) · Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He ·

    Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

    arXiv:2606.17591v1 Announce Type: new Abstract: Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and in…

  49. arXiv cs.LG TIER_1 English(EN) · Andreas Athanasopoulos, Christos Dimitrakakis ·

    Learning in Matching Games with Bandit Feedback

    arXiv:2506.03802v2 Announce Type: replace Abstract: We introduce a learning problem in a generalized two-sided matching market, where agents select actions to interact with their match. Specifically, we consider a setting in which matched agents engage in zero-sum games with init…

  50. arXiv cs.LG TIER_1 English(EN) · Steve Halley, Maur\'icio Gruppi ·

    Deep Reinforcement Learning for Minimum Zero-Forcing Sets

    arXiv:2606.18106v1 Announce Type: new Abstract: This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem w…

  51. arXiv cs.LG TIER_1 English(EN) · Cosmin Borsa, Michael Ludkovski ·

    Continuous-time Optimal Stopping through Deep Reinforcement Learning

    arXiv:2606.17545v1 Announce Type: new Abstract: Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal e…

  52. arXiv cs.CL TIER_1 English(EN) · Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li ·

    EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

    arXiv:2606.17680v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuit…

  53. arXiv cs.CL TIER_1 English(EN) · Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu ·

    Learning from the Self-future: On-policy Self-distillation for dLLMs

    arXiv:2606.18195v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. T…

  54. arXiv cs.AI TIER_1 English(EN) · Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng ·

    When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    arXiv:2605.05172v2 Announce Type: replace-cross Abstract: Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online lea…

  55. arXiv cs.AI TIER_1 English(EN) · Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll ·

    Knowledge Reutilization in Meta-Reinforcement Learning

    arXiv:2606.18132v1 Announce Type: new Abstract: Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-param…

  56. arXiv cs.CL TIER_1 English(EN) · Shiwei Liu ·

    Learning from the Self-future: On-policy Self-distillation for dLLMs

    On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-ri…

  57. arXiv cs.AI TIER_1 English(EN) · Alois Knoll ·

    Knowledge Reutilization in Meta-Reinforcement Learning

    Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, …

  58. arXiv cs.LG TIER_1 English(EN) · Maurício Gruppi ·

    Deep Reinforcement Learning for Minimum Zero-Forcing Sets

    This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propag…

  59. arXiv cs.CL TIER_1 English(EN) · Juanzi Li ·

    EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

    Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamic…

  60. arXiv cs.LG TIER_1 English(EN) · Michael Ludkovski ·

    Continuous-time Optimal Stopping through Deep Reinforcement Learning

    Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, app…

  61. arXiv cs.LG TIER_1 English(EN) · Jongmin Lee, Ernest K. Ryu, Vaneet Aggarwal ·

    Average-Reward Markov Decision Process 中从单轨迹学习策略

    arXiv:2606.16729v1 Announce Type: new Abstract: While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assump…

  62. arXiv cs.CL TIER_1 English(EN) · Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi ·

    用于搜索的代理强化学习导致指令调优错位

    arXiv:2510.17431v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find …

  63. arXiv cs.AI TIER_1 English(EN) · Nathan Gavenski, Juarez Monteiro, Francisco Galuppo, Adriano Veloso, Odinaldo Rodrigues ·

    拿不准时,就计划一下:用于反应式强化学习的承诺式小型语言模型推理

    arXiv:2606.16995v1 Announce Type: new Abstract: Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with…

  64. arXiv cs.AI TIER_1 English(EN) · Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang ·

    StarOR:协同树搜索与测试时强化学习以优化建模

    arXiv:2606.15197v1 Announce Type: cross Abstract: Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or …

  65. arXiv cs.AI TIER_1 English(EN) · Gengsheng Li, Mao Zheng, Mingyang Song, Ruiqi Liu, Tianyu Yang, Jie Sun, Qiyong Zhong, Haiyun Guo, Junfeng Fang, Dan Zhang, Jinqiao Wang ·

    具有课程回合级指导的多回合智能体 on-policy 蒸馏

    arXiv:2606.15912v1 Announce Type: cross Abstract: Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in prac…

  66. arXiv cs.AI TIER_1 English(EN) · Swaminathan S K, Damiya Gondha, Theyanesh Eswaramoorthy Rajahkrishnan, Aritra Hazra ·

    通过组合子目标评分实现定向条件策略,用于在线目标条件强化学习

    arXiv:2606.16515v1 Announce Type: cross Abstract: Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor …

  67. arXiv cs.AI TIER_1 English(EN) · Ardianto Wibowo, Paulo E Santos, Amer Baghdadi, Matthew Stephenson, Karl Sammut, Jean-Philippe Diguet ·

    强化学习中分布偏移的统一因果起源分类法

    arXiv:2606.16933v1 Announce Type: cross Abstract: Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between traini…

  68. arXiv cs.AI TIER_1 English(EN) · Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause ·

    Safe Exploration via Policy Priors

    arXiv:2601.19612v3 Announce Type: replace-cross Abstract: Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet co…

  69. arXiv cs.LG TIER_1 English(EN) · Timo Brand, Henry F\"orster, Stephen Kobourov, Daniel Kohrt, Robin Schukrafft, Markus Wallinger, Johannes Zink ·

    使用强化学习优化全局和局部交叉数

    arXiv:2509.06108v2 Announce Type: replace-cross Abstract: Graph drawing concerns the algorithmic visualization of graphs. A good drawing of a graph is easy to read and facilitates solving tasks on the graph. Several properties have been identified to occur in good drawings of gra…

  70. arXiv cs.LG TIER_1 English(EN) · Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai ·

    FlowRL:用于具有扩散策略的强化学习的分类法和模块化框架

    arXiv:2603.27450v2 Announce Type: replace Abstract: Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due …

  71. arXiv cs.LG TIER_1 English(EN) · Raj Ghugare, Micha{\l} Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach ·

    计算在强化学习中的作用

    arXiv:2602.05999v3 Announce Type: replace Abstract: How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not pro…

  72. arXiv cs.LG TIER_1 English(EN) · Yi Zhao, Aidan Scannell, Wenshuai Zhao, Yuxin Hou, Tianyu Cui, Le Chen, Dieter B\"uchler, Arno Solin, Juho Kannala, Joni Pajarinen ·

    通过用非精选数据指导世界模型实现高效强化学习

    arXiv:2502.19544v3 Announce Type: replace Abstract: Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that …

  73. arXiv cs.LG TIER_1 English(EN) · \c{S}evket Kaan Alk{\i}r, Naci Sald{\i}, Berkay Anahtarc{\i}, Can Deha Kar{\i}ks{\i}z ·

    Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

    arXiv:2606.16759v1 Announce Type: new Abstract: We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unkn…

  74. arXiv cs.LG TIER_1 English(EN) · Ekasit Usaratniwart, Xilin Gao, Marc Ong, Youhei Akimoto ·

    面向强化学习泛化的进化双层奖励塑造

    arXiv:2606.16236v1 Announce Type: new Abstract: Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but requir…

  75. Hugging Face Daily Papers TIER_1 English(EN) ·

    When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

    Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.

  76. Hugging Face Daily Papers TIER_1 English(EN) ·

    Learning from the Self-future: On-policy Self-distillation for dLLMs

    d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models.

  77. arXiv cs.AI TIER_1 English(EN) · Odinaldo Rodrigues ·

    拿不准时,就计划一下:用于反应式强化学习的承诺式小型语言模型推理

    Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM)…

  78. arXiv cs.AI TIER_1 English(EN) · Jean-Philippe Diguet ·

    强化学习中分布偏移的统一因果起源分类法

    Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and …

  79. arXiv cs.LG TIER_1 English(EN) · Can Deha Karıksız ·

    Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

    We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy …

  80. arXiv cs.LG TIER_1 English(EN) · Vaneet Aggarwal ·

    Average-Reward Markov Decision Process中从单轨迹学习策略

    While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generati…

  81. arXiv cs.AI TIER_1 English(EN) · Aritra Hazra ·

    通过组合子目标评分实现方向条件策略,用于在线目标条件强化学习

    Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically …

  82. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Youhei Akimoto ·

    Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

    Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and fu…

  83. arXiv cs.AI TIER_1 English(EN) · Ayoub Belouadah, Sylvain Kubler, Yves Le Traon ·

    CSPO:安全强化学习的约束敏感策略优化

    arXiv:2606.14415v1 Announce Type: new Abstract: Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they of…

  84. arXiv cs.LG TIER_1 English(EN) · Kai S. Yun, Zeyang Li, Navid Azizan ·

    可证明安全且可扩展的强化学习

    arXiv:2606.14536v1 Announce Type: new Abstract: Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provi…

  85. arXiv cs.LG TIER_1 English(EN) · Omar Adalat, Edwin Hamel-De le Court, Francesco Belardinelli ·

    面向安全多智能体强化学习的基于契约的组合屏蔽

    arXiv:2606.14130v1 Announce Type: new Abstract: Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentrali…

  86. arXiv cs.AI TIER_1 English(EN) · Kai Fukazawa, Kunal Mundada, Iman Soltani ·

    RAMAC:多模态风险感知离线强化学习及行为正则化的作用

    arXiv:2510.02695v3 Announce Type: replace-cross Abstract: In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse…

  87. arXiv cs.AI TIER_1 English(EN) · Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li ·

    弹性查询强化学习:面向VLA模型的自感知策略执行

    arXiv:2606.14375v1 Announce Type: cross Abstract: Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control…

  88. arXiv cs.LG TIER_1 English(EN) · Navid Azizan ·

    可证明安全且可扩展的强化学习

    Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned poli…

  89. arXiv cs.AI TIER_1 English(EN) · Yves Le Traon ·

    CSPO:安全强化学习的约束敏感策略优化

    Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, l…

  90. arXiv cs.AI TIER_1 English(EN) · Zhen Li ·

    Elastic Queries 强化学习:VLA模型的自感知策略执行

    Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more c…

  91. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Francesco Belardinelli ·

    基于合同的组合式屏蔽用于安全的多元强化学习

    Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but p…

  92. arXiv cs.AI TIER_1 English(EN) · Junfeng Guo Heng Huang ·

    PolicyGuard:面向强化学习智能体的测试时和步级对抗性防御

    arXiv:2606.12896v1 Announce Type: cross Abstract: While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerab…

  93. arXiv cs.CL TIER_1 English(EN) · Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo ·

    揭秘隐藏状态循环:基于策略强化学习的可切换潜在推理

    arXiv:2606.13106v1 Announce Type: cross Abstract: Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) an…

  94. arXiv cs.CL TIER_1 English(EN) · Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang ·

    SENTINEL:面向训练使用工具的语言模型代理的故障驱动强化学习

    arXiv:2606.12908v1 Announce Type: new Abstract: Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-polic…

  95. arXiv cs.AI TIER_1 English(EN) · Mintae Kim, Koushil Sreenath ·

    WOMBET:基于世界模型的经验迁移,用于鲁棒且样本高效的强化学习

    arXiv:2604.08958v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically …

  96. arXiv cs.LG TIER_1 English(EN) · Shaivi Malik ·

    用于神经模型编辑的强化学习

    Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learni…

  97. arXiv cs.CL TIER_1 English(EN) · Zhijiang Guo ·

    揭秘隐藏状态循环:基于策略强化学习的可切换潜在推理

    Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is t…

  98. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Arnaud Braud ·

    $α$-fair heterogeneous agent reinforcement learning

    Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social b…

  99. arXiv cs.CL TIER_1 English(EN) · Dakuo Wang ·

    SENTINEL:用于训练使用工具的语言模型代理的故障驱动强化学习

    Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own e…

  100. arXiv cs.CL TIER_1 English(EN) · Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen ·

    RLCSD:基于对比的在线策略自蒸馏强化学习

    arXiv:2606.11709v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. Howe…

  101. arXiv cs.LG TIER_1 English(EN) · Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang ·

    UniIntervene:用于高效现实世界强化学习的代理干预

    arXiv:2606.12372v1 Announce Type: cross Abstract: Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain interven…

  102. arXiv cs.LG TIER_1 English(EN) · Bal\'azs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann ·

    Fourier特征使智能体能够通过模仿学习掌握高精度策略

    arXiv:2606.12334v1 Announce Type: new Abstract: High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directl…

  103. arXiv cs.LG TIER_1 English(EN) · Felix St\"orck, Fabian Hinder, Barbara Hammer ·

    空间采样值衰减:非平稳深度强化学习的遗忘机制

    arXiv:2606.11797v1 Announce Type: new Abstract: Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior tha…

  104. arXiv cs.AI TIER_1 English(EN) · Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto ·

    Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

    arXiv:2603.14867v4 Announce Type: replace-cross Abstract: Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower s…

  105. arXiv cs.AI TIER_1 English(EN) · Xin Chen, Jie Zhang, Florian Tram\`er ·

    学习注入:通过强化学习实现自动化提示注入

    arXiv:2602.05746v2 Announce Type: replace-cross Abstract: Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks sh…

  106. arXiv cs.AI TIER_1 English(EN) · Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang ·

    超越次优性泛化:离线强化学习通过随机解学习有效调度

    arXiv:2509.10303v2 Announce Type: replace-cross Abstract: Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated env…

  107. arXiv cs.AI TIER_1 English(EN) · Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singel\'ee, Robin Degraeve, Bart Preneel ·

    强化学习颠覆基于梯度的对抗性优化

    arXiv:2606.12251v1 Announce Type: cross Abstract: Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcem…

  108. arXiv cs.AI TIER_1 English(EN) · Frank Xiao, Mary Phuong ·

    泛化攻击:模型可通过阻止行为泛化来操纵强化学习

    arXiv:2606.12016v1 Announce Type: cross Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware,…

  109. arXiv cs.AI TIER_1 English(EN) · Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen ·

    面向架构的强化学习使滑动窗口注意力在数学推理中具有竞争力

    arXiv:2606.11634v1 Announce Type: new Abstract: The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding…

  110. Hugging Face Daily Papers TIER_1 English(EN) ·

    揭秘隐藏状态循环:基于在线策略强化学习的可切换潜在推理

    A switchable latent reasoning framework uses explicit boundary tokens to enable trainable and interpretable latent reasoning through recurrent hidden states.

  111. arXiv cs.LG TIER_1 English(EN) · Ziwei Wang ·

    UniIntervene:用于高效现实世界强化学习的代理干预

    Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human correcti…

  112. arXiv cs.LG TIER_1 English(EN) · Gerhard Neumann ·

    Fourier特征使智能体能够通过模仿学习获得高精度策略

    High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a …

  113. arXiv cs.AI TIER_1 English(EN) · Bart Preneel ·

    强化学习颠覆基于梯度的对抗性优化

    Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradien…

  114. arXiv cs.AI TIER_1 English(EN) · Mary Phuong ·

    泛化攻击:模型可通过阻止行为泛化来操纵强化学习

    Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the…

  115. arXiv cs.LG TIER_1 English(EN) · Barbara Hammer ·

    空间采样值衰减:非平稳深度强化学习的遗忘机制

    Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-s…

  116. arXiv cs.CL TIER_1 English(EN) · Lijie Wen ·

    RLCSD:基于对比的在线策略自蒸馏强化学习

    On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from t…

  117. Hugging Face Daily Papers TIER_1 English(EN) ·

    RLCSD:基于对比的在线策略自蒸馏强化学习

    On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from t…

  118. arXiv cs.AI TIER_1 English(EN) · Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta ·

    事件驱动强化学习赋能半导体制造中的长时域控制

    arXiv:2606.10705v1 Announce Type: cross Abstract: Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of proces…

  119. arXiv cs.AI TIER_1 English(EN) · Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu ·

    推理还是记忆?LLM强化学习中的方向感知多样性探索

    arXiv:2606.10346v1 Announce Type: new Abstract: Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encour…

  120. arXiv cs.AI TIER_1 English(EN) · Alessandro Trapasso, Luca Iocchi, Fabio Patrizi ·

    离散动作非马尔可夫奖励决策过程中的基于模型的强化学习

    arXiv:2512.14617v2 Announce Type: replace-cross Abstract: Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not s…

  121. arXiv cs.AI TIER_1 English(EN) · Lucas Schott, Josephine Delas, Hatem Hajri, Elies Gherbi, Reda Yaich, Nora Boulahia-Cuppens, Frederic Cuppens, Sylvain Lamprier ·

    通过对抗性攻击和训练实现鲁棒的深度强化学习:一项调查

    arXiv:2403.00420v3 Announce Type: replace-cross Abstract: Deep Reinforcement Learning (DRL) is a subfield of machine learning for training autonomous agents that take sequential actions across complex environments. Despite its significant performance in well-known environments, i…

  122. arXiv cs.AI TIER_1 English(EN) · Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li ·

    RoboGPT-R1:利用强化学习增强机器人任务规划

    arXiv:2510.14828v3 Announce Type: replace Abstract: Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language …

  123. arXiv cs.AI TIER_1 English(EN) · Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji ·

    TRACE:一种用于高效智能体强化学习的统一部署预算分配框架

    arXiv:2606.11119v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient r…

  124. arXiv cs.AI TIER_1 English(EN) · Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine ·

    Reinforcement Learning 中流策略的测试时间梯度引导

    arXiv:2606.11087v1 Announce Type: cross Abstract: Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the superv…

  125. arXiv cs.AI TIER_1 English(EN) · Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu ·

    超越LLM强化学习中的统一Token级信任域

    arXiv:2606.10968v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens …

  126. arXiv cs.AI TIER_1 English(EN) · Jo\~ao Coelho, Jo\~ao Magalh\~aes, Bruno Martins, Chenyan Xiong ·

    训练中通过回收零方差查询实现高效的智能体搜索强化学习

    arXiv:2606.10709v1 Announce Type: cross Abstract: The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes success…

  127. arXiv cs.AI TIER_1 English(EN) · Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo ·

    通过引导流Q学习实现快速且高度表达的离线强化学习策略学习

    arXiv:2606.10613v1 Announce Type: cross Abstract: Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to …

  128. arXiv cs.LG TIER_1 English(EN) · Auguste Lehuger, Guillaume Henon-Just ·

    面向二维不规则排样的几何感知强化学习

    arXiv:2606.10611v1 Announce Type: new Abstract: Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical…

  129. arXiv cs.LG TIER_1 English(EN) · Tai Nguyen, Phong Le, Carola Doerr, Nguyen Dang ·

    使用深度强化学习发现进化算法的可解释多参数控制策略

    arXiv:2606.10129v1 Announce Type: new Abstract: While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, o…

  130. Hugging Face Daily Papers TIER_1 English(EN) ·

    TRACE:一种用于高效智能体强化学习的统一部署预算分配框架

    TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.

  131. arXiv cs.CL TIER_1 English(EN) · Xiangyang Ji ·

    TRACE:一种用于高效智能体强化学习的统一部署预算分配框架

    Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or comp…

  132. arXiv cs.LG TIER_1 English(EN) · Sergey Levine ·

    Reinforcement Learning 中流策略的测试时间梯度引导

    Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating the…

  133. arXiv cs.LG TIER_1 English(EN) · Wenxi Zhu ·

    超越LLM强化学习中的统一Token级信任域

    Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts …

  134. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Chenyan Xiong ·

    通过在训练期间回收零方差查询实现高效的智能体搜索强化学习

    The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-in…

  135. arXiv cs.AI TIER_1 English(EN) · Andrea Matta ·

    事件驱动强化学习实现半导体制造中的长时域控制

    Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. Th…

  136. arXiv cs.AI TIER_1 English(EN) · Chang D. Yoo ·

    通过引导流Q学习实现快速且高度表达的离线强化学习策略学习

    Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step…

  137. Hugging Face Daily Papers TIER_1 English(EN) ·

    面向二维不规则排样中的几何感知强化学习

    Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforc…

  138. arXiv cs.LG TIER_1 English(EN) · Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan ·

    改革LLM强化学习以实现黑盒差异下的高效训练

    arXiv:2606.08779v1 Announce Type: new Abstract: Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train…

  139. arXiv cs.LG TIER_1 English(EN) · Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang ·

    Graph-GRPO:使用强化学习训练图流模型

    arXiv:2603.10395v2 Announce Type: replace Abstract: Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexi…

  140. arXiv cs.LG TIER_1 English(EN) · Lingkai Kong, Anagha Satish, Hezi Jiang, Akseli Kangaslahti, Andrew Ma, Wenbo Chen, Mingxiao Song, Lily Xu, Milind Tambe ·

    用于组合动作强化学习的潜在球形流策略

    arXiv:2601.22211v2 Announce Type: replace Abstract: Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impract…

  141. arXiv cs.LG TIER_1 English(EN) · Paulius Sasnauskas, Yi\u{g}it Yal{\i}n, Goran Radanovi\'c ·

    奖励投毒攻击下的鲁棒上下文强化学习

    arXiv:2506.06891v3 Announce Type: replace Abstract: We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we …

  142. arXiv cs.LG TIER_1 English(EN) · Qinghe Gao, Artur M. Schweidtmann ·

    面向过程设计的深度强化学习:综述与展望

    arXiv:2308.07822v2 Announce Type: replace Abstract: The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate th…

  143. arXiv cs.LG TIER_1 English(EN) · Alexander DeRieux, Walid Saad ·

    QnRL: 量子原生强化学习

    arXiv:2606.08276v1 Announce Type: cross Abstract: Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these envi…

  144. arXiv cs.LG TIER_1 English(EN) · Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu ·

    Claw-R1:用于 Agentic 强化学习的步进式数据中间件系统

    arXiv:2606.09138v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focu…

  145. arXiv cs.LG TIER_1 English(EN) · Jike Zhong, Yuxiang Lai, Ming Li, Yuheng Li, Wuao Liu, Behzad Dariush, Konstantinos Psounis, Shao-Yuan Lo ·

    从捷径到推理:使用强化学习对心智理论进行鲁棒的训练后学习

    arXiv:2606.09092v1 Announce Type: new Abstract: Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is co…

  146. arXiv cs.LG TIER_1 English(EN) · Aditya Upadhyay ·

    UNIQ:离线强化学习中自适应保守性的一致性校准

    arXiv:2606.07592v1 Announce Type: new Abstract: Offline reinforcement learning requires careful conservatism to mitigate distribution shift, yet most existing methods apply a fixed penalty uniformly across all states regardless of local data coverage. We present UNIQ (Uncertainty…

  147. arXiv cs.AI TIER_1 English(EN) · Fernando Martinez-Lopez, Tao Li, Yingdong Lu, Juntao Chen ·

    通过通信世界模型进行上下文强化学习

    arXiv:2508.06659v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their t…

  148. arXiv cs.AI TIER_1 English(EN) · Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Mingyu Liu, Zheng Huang, Anzhou Li, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen ·

    ACTIVE-o3:通过纯强化学习赋能MLLMs的主动感知

    arXiv:2505.21457v2 Announce Type: replace-cross Abstract: Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in hum…

  149. arXiv cs.AI TIER_1 English(EN) · Shixiong Jiang, Taozheng Zhu, Fanxin Kong ·

    Safe-RULE: Safe Reinforcement UnLEarning

    arXiv:2606.09559v1 Announce Type: cross Abstract: Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline S…

  150. arXiv cs.AI TIER_1 English(EN) · Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki ·

    HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

    arXiv:2606.08610v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, …

  151. arXiv cs.AI TIER_1 English(EN) · Yuchen He, Baolong Bi, Shenghua Liu, Huaming Liao, Yuyao Ge, Bolin Wan, Siqian Tong, Juan Chen, Jiafeng Guo, Xueqi Cheng ·

    SAW:大型语言模型多目标强化学习的阶段感知动态加权

    arXiv:2606.07705v1 Announce Type: cross Abstract: Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: rewa…

  152. arXiv cs.AI TIER_1 English(EN) · Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao ·

    推理的动量:策略优化中的密集内在信号

    arXiv:2606.08815v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely …

  153. arXiv cs.AI TIER_1 English(EN) · Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo ·

    用于质量-多样性强化学习的结构条件 Actor-Critic 分支

    arXiv:2606.08735v1 Announce Type: new Abstract: Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evalua…

  154. arXiv cs.AI TIER_1 English(EN) · Ashkan Ansarifard (Sapienza University of Rome), Matteo Mancanelli (Sapienza University of Rome), Elena Umili (Sapienza University of Rome), Fabio Patrizi (Sapienza University of Rome) ·

    自回归强化学习策略中的神经符号LTLf约束注入

    arXiv:2606.08312v1 Announce Type: new Abstract: In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformer…

  155. Hugging Face Daily Papers TIER_1 English(EN) ·

    Reinforcement Learning 中流策略的测试时间梯度引导

    QGF is an RL algorithm that improves policies at test time by using a value gradient to guide a pre-trained flow policy, avoiding training-time instability while maintaining competitive performance.

  156. Hugging Face Daily Papers TIER_1 English(EN) ·

    超越LLM强化学习中的统一Token级信任域

    CPPO addresses limitations in reinforcement learning with verifiable rewards by introducing position-weighted thresholds and cumulative prefix budgeting to better handle autoregressive generation challenges.

  157. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Nguyen Dang ·

    使用深度强化学习发现进化算法的可解释多参数控制策略

    While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, in…

  158. arXiv cs.AI TIER_1 English(EN) · Fanxin Kong ·

    Safe-RULE: Safe Reinforcement UnLEarning

    Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversarie…

  159. arXiv cs.CL TIER_1 English(EN) · Qi Liu ·

    Claw-R1:面向Agentic强化学习的步进式数据中间件系统

    Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and traini…

  160. arXiv cs.AI TIER_1 English(EN) · Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi ·

    Just-In-Time Reinforcement Learning:LLM 智能体在无梯度更新情况下的持续学习

    arXiv:2601.18510v2 Announce Type: replace-cross Abstract: While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but …

  161. arXiv cs.AI TIER_1 English(EN) · Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang ·

    CHDP:参数化动作空间强化学习中的合作混合扩散策略

    arXiv:2601.05675v2 Announce Type: replace Abstract: Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space rema…

  162. arXiv cs.LG TIER_1 English(EN) · Haruto Tanaka, A. Rupam Mahmood ·

    深度强化学习中的性能变化

    arXiv:2606.06746v1 Announce Type: new Abstract: Deep reinforcement learning (RL) algorithms often suffer from low run-to-run robustness, manifesting as significant performance variation across independent runs of identically configured agents. Although this issue poses a spectrum…

  163. arXiv cs.LG TIER_1 English(EN) · Ujjwal Bhatta, Utsabi Dangol, Sumaly Bajracharya, Rodrigue Rizk, KC Santosh ·

    不确定性感知的大语言模型引导策略塑造用于稀疏奖励强化学习

    arXiv:2606.06673v1 Announce Type: new Abstract: Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Gui…

  164. arXiv cs.AI TIER_1 English(EN) · Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang ·

    ViVa:用于机器人强化学习的视频生成价值模型

    arXiv:2604.08168v2 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning …

  165. arXiv cs.AI TIER_1 English(EN) · Wo Wei Lin, Ethan Rathbun, Enrico Marchesini, Xiang Zhi Tan ·

    合作多智能体强化学习中的鲁棒指令遵循

    arXiv:2605.12655v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewar…

  166. arXiv cs.LG TIER_1 English(EN) · Wenpu Liu, Yuqi Xu, Weichu Xie, Yongfu Zhu, Shuai Dong, Ziyue Wang, Wenqi Shao, Xiaoying Zhang, Tong Yang, Nan Duan, Jiaqi Wang ·

    利用群组推广中的错误多样性进行强化学习

    arXiv:2605.17333v2 Announce Type: replace Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the…

  167. arXiv cs.LG TIER_1 English(EN) · Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li ·

    SHAP 指导的核 Actor-Critic 用于可解释强化学习

    arXiv:2512.05291v3 Announce Type: replace Abstract: Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state feature…

  168. arXiv cs.AI TIER_1 English(EN) · Fabio Patrizi ·

    自回归强化学习策略中的神经符号LTLf约束注入

    In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to…

  169. arXiv cs.LG TIER_1 English(EN) · Walid Saad ·

    QnRL: 量子原生强化学习

    Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these environments, existing QRL architectures indirectly ap…

  170. arXiv cs.AI TIER_1 English(EN) · Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel ·

    对抗性代理:基于强化学习的黑盒规避攻击

    arXiv:2503.01734v3 Announce Type: replace-cross Abstract: Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generat…

  171. arXiv cs.LG TIER_1 English(EN) · Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang ·

    通过强化学习训练一个模型来掌握跨级别代理行为

    arXiv:2512.09706v2 Announce Type: replace Abstract: The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces-such as exclusively using APIs, GUI eve…

  172. arXiv cs.LG TIER_1 English(EN) · Boyang Xu, Qing Zou, Siqin Yang, Hao Yan ·

    Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    arXiv:2605.08253v2 Announce Type: replace Abstract: Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismat…

  173. arXiv cs.LG TIER_1 English(EN) · Ali Saheb Pasand, Johan Obando-Ceron, Aaron Courville, Pouya Bashivan, Pablo Samuel Castro ·

    通过各向同性高斯表示实现稳定深度强化学习

    arXiv:2602.19373v3 Announce Type: replace Abstract: Deep reinforcement learning systems often suffer from unstable training dynamics due to non-stationarity, where learning objectives and data distributions evolve over time. We show that under non-stationary targets, isotropic Ga…

  174. arXiv cs.LG TIER_1 English(EN) · Elizabeth Bates, Chris Hicks, Vasilios Mavroudis ·

    强化学习用于网络防御的超越奖励机制

    arXiv:2602.04809v3 Announce Type: replace Abstract: Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, …

  175. arXiv cs.LG TIER_1 English(EN) · Giorgio Maria Cavallazzi, Miguel P\'erez-Cuadrado, Alfredo Pinelli ·

    减阻还是奖励作弊?可复现的多智能体强化学习,赢得其应得的奖励

    arXiv:2606.06227v1 Announce Type: cross Abstract: A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-…

  176. arXiv cs.LG TIER_1 English(EN) · Nguyen Cong Luong, Shaohan Feng, Nguyen Duc Hai, Zeping Sui, Bo Ma, Min Xu, Zhihao Dong, Qiushi Zhao, Nguyen Duc Duy Anh, Nguyen Quoc Khanh, Ngoc Hung Nguyen, Zitian Zhang, Jie Cao ·

    Transformer 增强强化学习:通信网络中的基础与应用

    arXiv:2606.05208v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has long been a powerful solution to various problems in communication networks. However, traditional RL models still face with several limitations. Not only do they rely on large numbers of interaction…

  177. arXiv cs.LG TIER_1 English(EN) · Haoyang Hong, Zichen Wang, Quanquan Gu, Huazheng Wang ·

    在线KL正则化强化学习与函数逼近在误设情况下的应用

    arXiv:2606.06053v1 Announce Type: new Abstract: We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecif…

  178. arXiv cs.LG TIER_1 English(EN) · Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen ·

    当更密集的信用不足以支撑:面向长时域LLM智能体训练的证据校准策略优化

    arXiv:2606.05885v1 Announce Type: new Abstract: Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level …

  179. arXiv cs.LG TIER_1 English(EN) · Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro ·

    表示学习赋能可扩展多任务深度强化学习

    arXiv:2606.05555v1 Announce Type: new Abstract: Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it uncle…

  180. arXiv cs.LG TIER_1 English(EN) · Chirag Chawla, Rohan Charudatt Salvi, Madhav S. Baidya ·

    选择性优势熵自适应视野GRPO:用于语言模型高效强化学习的不对称令牌级折扣

    arXiv:2606.05434v1 Announce Type: new Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We i…

  181. arXiv cs.LG TIER_1 English(EN) · Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, No\"el Vouitsis, Brendan Leigh Ross ·

    Agentic Monte Carlo:为黑盒代理模拟强化学习

    arXiv:2606.05296v1 Announce Type: new Abstract: LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-…

  182. arXiv cs.LG TIER_1 English(EN) · Renwei Meng ·

    面向可验证长时域语言智能体的策略条件反事实信用强化学习

    arXiv:2606.05263v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing proc…

  183. Hugging Face Daily Papers TIER_1 English(EN) ·

    StepPO:面向智能体强化学习的步进对齐策略优化

    StepPO introduces a step-centric approach for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming existing token-centric methods in multi-turn interaction tasks.

  184. arXiv cs.LG TIER_1 English(EN) · Alfredo Pinelli ·

    减阻还是奖励破解?可重复多智能体强化学习,赢得其奖励

    A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-conservation projection couples agents' outputs an…

  185. arXiv cs.LG TIER_1 English(EN) · Huazheng Wang ·

    在线KL正则化强化学习与函数逼近在误设情况下的应用

    We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecified models, where classical regret bounds may fa…

  186. arXiv cs.AI TIER_1 English(EN) · Saket Tiwari, Tejas Kotwal, George Konidaris ·

    从离散到连续:神经强化学习在连续环境中的动态

    arXiv:2606.04275v1 Announce Type: cross Abstract: We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on pre…

  187. arXiv cs.AI TIER_1 English(EN) · Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui, Qi Zhu, Fei Mi, Hongning Wang, Minlie Huang ·

    RUBAS: 基于规则的强化学习用于代理安全

    arXiv:2606.04051v1 Announce Type: cross Abstract: The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or st…

  188. arXiv cs.AI TIER_1 English(EN) · Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad ·

    基于分布DAgger的丰富反馈强化学习

    arXiv:2606.05152v1 Announce Type: cross Abstract: Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the fina…

  189. arXiv cs.AI TIER_1 English(EN) · Mohit Prashant, Arvind Easwaran ·

    面向具有概率近似安全保证的风险感知强化学习的场景生成

    arXiv:2606.04812v1 Announce Type: cross Abstract: Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unkn…

  190. arXiv cs.AI TIER_1 English(EN) · Viktor Vesel\'y, Aleksandar Todorov, Erwan Escudie, Matthia Sabatelli ·

    Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

    arXiv:2606.04735v1 Announce Type: cross Abstract: Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement lea…

  191. arXiv cs.LG TIER_1 English(EN) · Jiayi Wang, Zhengling Qi, Chengchun Shi ·

    人机交互的福音:在混淆环境中的超级强化学习

    arXiv:2209.15448v3 Announce Type: replace Abstract: As AI becomes more prevalent throughout society, effective methods of integrating humans and AI systems that leverage their respective strengths and mitigate risk have become an important priority. In this paper, we introduce th…

  192. arXiv cs.LG TIER_1 English(EN) · Guopeng Li, Moritz A. Zanger, Matthijs T. J. Spaan, Julian F. P. Kooij ·

    COP-Q:一种通过乔列斯基有序投影实现机器人控制的安全优先强化学习

    arXiv:2606.04749v1 Announce Type: cross Abstract: Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled i…

  193. arXiv cs.LG TIER_1 English(EN) · Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan ·

    通过动作推理和重要性采样增强MADDPG算法以实现多智能体学习

    arXiv:2606.05021v1 Announce Type: new Abstract: We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each a…

  194. arXiv cs.LG TIER_1 English(EN) · Sabine Rieder, Stefan Pranger, Debraj Chakraborty, Jan K\v{r}et\'insk\'y, Bettina K\"onighofer ·

    可解释的安全强化学习

    arXiv:2606.04634v1 Announce Type: new Abstract: Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque.…

  195. arXiv cs.AI TIER_1 English(EN) · Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding ·

    AgentJet:用于智能体强化学习的灵活集群训练框架

    arXiv:2606.04484v1 Announce Type: new Abstract: We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a dec…

  196. arXiv cs.AI TIER_1 English(EN) · Ajay Vishwanath, Christian Omlin ·

    爱之迷雾:在游戏环境中利用基于亲和力的强化学习工程化良性代理行为

    arXiv:2606.04750v1 Announce Type: new Abstract: Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to in…

  197. arXiv cs.LG TIER_1 English(EN) · Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu ·

    合作多智能体强化学习的片段记忆时间一致性

    arXiv:2606.04492v1 Announce Type: new Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often…

  198. arXiv cs.CL TIER_1 English(EN) · Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan ·

    Graph-R1:通过端到端强化学习实现Agentic GraphRAG框架

    arXiv:2507.21892v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as…

  199. arXiv cs.CL TIER_1 English(EN) · Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu ·

    GARL:多智能体博弈论强化学习策略优先排序

    arXiv:2606.05002v1 Announce Type: new Abstract: LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. M…

  200. arXiv cs.AI TIER_1 English(EN) · Parnian Behdin, Kevin Roice, Golnaz Mesbahi ·

    职位:已部署的强化学习应是持续性的

    arXiv:2606.04029v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until p…

  201. arXiv cs.CL TIER_1 English(EN) · Tej Deep Pala, Vernon Toh, Soujanya Poria ·

    GRAIL:用于具有可验证奖励的强化学习的梯度重加权优势

    arXiv:2606.04889v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens,…

  202. arXiv cs.AI TIER_1 English(EN) · Melvin Laux, Yi-Ling Liu, Rina Alo, S\"oren T\"opper, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam ·

    用于自主珊瑚礁监测的上下文多任务强化学习

    arXiv:2604.12645v2 Announce Type: replace-cross Abstract: Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary u…

  203. arXiv cs.AI TIER_1 English(EN) · Jiashu Yao, Heyan Huang, Daiqing Wu, Zeming Liu, Yuhang Guo ·

    政策分歧:通过双模熵正则化激励LLM强化中的双模探索

    arXiv:2604.11510v2 Announce Type: replace-cross Abstract: To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entr…

  204. Hugging Face Daily Papers TIER_1 English(EN) ·

    表示学习赋能可扩展多任务深度强化学习

    Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalabilit…

  205. arXiv cs.LG TIER_1 English(EN) · Paria Rashidinejad ·

    基于分布DAgger的丰富反馈强化学习

    Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide ric…

  206. arXiv cs.LG TIER_1 English(EN) · Hamza Khan ·

    通过动作推断和重要性采样增强多智能体学习的MADDPG算法

    We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, …

  207. arXiv cs.CL TIER_1 English(EN) · Zhiyuan Liu ·

    GARL:多智能体博弈论强化学习策略优先排序

    LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise t…

  208. arXiv cs.CL TIER_1 English(EN) · Soujanya Poria ·

    GRAIL:用于具有可验证奖励的强化学习的梯度重加权优势

    Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for …

  209. Hugging Face Daily Papers TIER_1 English(EN) ·

    GRAIL:用于具有可验证奖励的强化学习的梯度重加权优势

    Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for …

  210. arXiv cs.LG TIER_1 English(EN) · Arvind Easwaran ·

    具有概率近似安全保证的风险感知强化学习的场景生成

    Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verifi…

  211. arXiv cs.LG TIER_1 English(EN) · Christian Omlin ·

    爱之迷雾:在游戏环境中利用基于亲和力的强化学习工程化良性代理行为

    Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully d…

  212. arXiv cs.LG TIER_1 English(EN) · Julian F. P. Kooij ·

    COP-Q:通过乔莱斯基有序投影实现机器人控制的安全优先强化学习

    Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wi…

  213. Hugging Face Daily Papers TIER_1 English(EN) ·

    Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

    Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB).…

  214. arXiv cs.LG TIER_1 English(EN) · Matthia Sabatelli ·

    Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

    Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB).…

  215. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Bolin Ding ·

    AgentJet:用于 Agentic 强化学习的灵活集群训练框架

    We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm se…

  216. Hugging Face Daily Papers TIER_1 English(EN) ·

    AgentJet:用于 Agentic 强化学习的灵活集群训练框架

    We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm se…

  217. arXiv cs.AI TIER_1 English(EN) · Minping Chen, Bowen Xiao, Du Liang, Chuxuan Zeng, Zeyi Wen ·

    LLM强化学习的高效超参数优化

    arXiv:2606.03073v1 Announce Type: cross Abstract: Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO metho…

  218. arXiv cs.AI TIER_1 English(EN) · Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye ·

    EvoTrainer:自主智能体强化学习的联合演化大模型策略与训练框架

    arXiv:2606.03108v1 Announce Type: new Abstract: Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introdu…

  219. arXiv cs.AI TIER_1 English(EN) · Chengdong Ma, Th\'eo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Bo Li, Yuan Cheng, Yuan Qi, Yaodong Yang ·

    用博弈论强化学习寻找接吻数

    arXiv:2511.13391v4 Announce Type: replace-cross Abstract: Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a defining challenge in discrete geometry. As the local an…

  220. arXiv cs.AI TIER_1 English(EN) · Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia ·

    Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

    arXiv:2511.02304v2 Announce Type: replace-cross Abstract: We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables br…

  221. arXiv cs.AI TIER_1 English(EN) · Matteo Gallici, Ivan Masmitja, Mario Mart\'in ·

    通过自主车辆扩展多智能体强化学习用于水声跟踪

    arXiv:2505.08222v3 Announce Type: replace-cross Abstract: Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essent…

  222. arXiv cs.AI TIER_1 English(EN) · Leonard Hinckeldey, Elliot Fosong, Rimvydas Rubavicius, Elle Miller, Trevor McInroe, Fan Zhang, Patricia Wollstadt, Stefano V. Albrecht, Subramanian Ramamoorthy ·

    Assistax:一个用于辅助机器人的多智能体硬件加速强化学习基准

    arXiv:2507.21638v2 Announce Type: replace Abstract: The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run a…

  223. arXiv cs.AI TIER_1 English(EN) · Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou ·

    面向视觉条件下的无人机导航的自精炼代理强化学习

    arXiv:2606.03963v1 Announce Type: cross Abstract: Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fin…

  224. arXiv cs.AI TIER_1 English(EN) · Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Ayg\"un, David Smalling, Shibl Mourad, Doina Precup, Andr\'e Barreto, Mark Rowland ·

    利用奖励不确定性诱导强化学习中的多样化行为

    arXiv:2606.03962v1 Announce Type: cross Abstract: Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity.…

  225. arXiv cs.AI TIER_1 English(EN) · Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi ·

    合成与奖励——在真实环境中进行多步工具使用的强化学习

    arXiv:2606.03892v1 Announce Type: cross Abstract: Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual stat…

  226. arXiv cs.AI TIER_1 English(EN) · Hongye Cao, Nuo Yan, Haoyuan Deng, Ziwei Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao ·

    面向高效智能体强化学习的带工具意识的熵引导优化

    arXiv:2606.03762v1 Announce Type: cross Abstract: Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-relian…

  227. arXiv cs.AI TIER_1 English(EN) · Siemen Herremans, Ali Anwar, Siegfried Mercelis ·

    基于模型的强化学习的后验稳健性

    arXiv:2606.03521v1 Announce Type: cross Abstract: To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a…

  228. arXiv cs.AI TIER_1 English(EN) · Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong, Mingxuan Yuan ·

    ASymPO:用于异步大语言模型无行为信息后训练的非对称尺度策略优化

    arXiv:2606.03070v1 Announce Type: cross Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected meth…

  229. arXiv cs.LG TIER_1 English(EN) · Stefan Pranger, Bettina K\"onighofer ·

    易于使用的强化学习屏蔽

    arXiv:2606.03804v1 Announce Type: new Abstract: Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that …

  230. arXiv cs.LG TIER_1 English(EN) · Can Lv, Mingju Chen, Heng Chang, Shiji Zhou ·

    缓解错误信用传播:基于规则的强化学习的概率图奖励聚合

    arXiv:2606.03361v1 Announce Type: new Abstract: Rubric-based rewards are increasingly used for open-ended language model post-training, but criterion-level scores are often aggregated as independent utilities. This flat scalarization ignores rubric-specified prerequisite and acti…

  231. arXiv cs.CL TIER_1 English(EN) · Yanyu Zhu, Hoilam Pao, Niu Hu, Wei Guo, Shaoxiong Zhan, Boyu Lai, Zitai Wang, Yongqin Zeng, Hai-Tao Zheng ·

    Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

    arXiv:2606.03113v1 Announce Type: new Abstract: Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We refram…

  232. Hugging Face Daily Papers TIER_1 English(EN) ·

    基于分布DAgger的丰富反馈强化学习

    Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.

  233. Hugging Face Daily Papers TIER_1 English(EN) ·

    GRAIL:用于具有可验证奖励的强化学习的梯度重加权优势

    Gradient-Reweighted Advantage (GRAIL) improves mathematical reasoning in LLMs by reweighting token-wise advantages based on gradient-activation saliency, outperforming GRPO in accuracy and Pass@3 metrics.

  234. arXiv cs.AI TIER_1 English(EN) · Dzmitry Tsetserukou ·

    面向视觉条件下的无人机导航的自精炼代理强化学习

    Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not gua…

  235. arXiv cs.AI TIER_1 English(EN) · Mark Rowland ·

    利用奖励不确定性诱导强化学习中的多样化行为

    Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization …

  236. Hugging Face Daily Papers TIER_1 English(EN) ·

    利用奖励不确定性诱导强化学习中的多样化行为

    Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization …

  237. arXiv cs.AI TIER_1 English(EN) · Pavan Kapanipathi ·

    合成与奖励——在真实环境中进行多步工具使用的强化学习

    Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), a…

  238. arXiv cs.LG TIER_1 English(EN) · Bettina Könighofer ·

    易于使用的强化学习屏蔽

    Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decis…

  239. arXiv cs.AI TIER_1 English(EN) · Yang Gao ·

    面向高效智能体强化学习的带工具意识的熵引导优化

    Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, w…

  240. arXiv cs.LG TIER_1 English(EN) · Siegfried Mercelis ·

    基于模型的强化学习的后验鲁棒性

    To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an …

  241. Hugging Face Daily Papers TIER_1 English(EN) ·

    基于模型的强化学习的后验鲁棒性

    To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an …

  242. arXiv cs.LG TIER_1 English(EN) · Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters ·

    使用学习奖励对大型行为模型进行连贯的离策略改进

    arXiv:2606.02194v1 Announce Type: new Abstract: Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL)…

  243. arXiv cs.LG TIER_1 English(EN) · Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang ·

    FM-IRL:用于强化学习中奖励建模和策略正则化的流匹配

    arXiv:2510.09222v3 Announce Type: replace Abstract: Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness…

  244. arXiv cs.LG TIER_1 English(EN) · Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang ·

    MACCA:具有因果信用分配的离线多智能体强化学习

    arXiv:2312.03644v3 Announce Type: replace Abstract: Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to i…

  245. arXiv cs.LG TIER_1 English(EN) · Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra ·

    RDA:强化学习的奖励设计代理

    arXiv:2606.01672v1 Announce Type: new Abstract: Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka,…

  246. arXiv cs.LG TIER_1 English(EN) · Bernd Frauenknecht, Devdutt Subhasish, Artur Eisele, Friedrich Solowjow, Sebastian Trimpe ·

    所有模型都可能出错,知道“在哪里”才有用:关于强化学习中的模型不确定性

    arXiv:2606.01363v1 Announce Type: new Abstract: Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccur…

  247. arXiv cs.LG TIER_1 English(EN) · Hikmet Simsir, Ozgur S. Oguz ·

    拉格朗日摄动扩散引导:生成策略的潜在强化学习

    arXiv:2606.01151v1 Announce Type: new Abstract: Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance,…

  248. arXiv cs.LG TIER_1 English(EN) · Shao-An Yin ·

    无需乘数共享的分布式GNEP算法及其在多机器人协调和基于上下文老虎机的活动学习中的应用

    arXiv:2606.00759v1 Announce Type: new Abstract: Recent advances in artificial intelligence have expanded the focus from classical optimization to include equilibrium analysis in noncooperative games. Many such games involve shared constraints, leading to Generalized Nash Equilibr…

  249. arXiv cs.CL TIER_1 English(EN) · Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen ·

    StepPO:面向智能体强化学习的步进对齐策略优化

    arXiv:2604.18401v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where toke…

  250. arXiv cs.CL TIER_1 English(EN) · V\'ictor Gallego ·

    超越标量奖励:序列博弈中用于LLM策略合成的密集反馈

    arXiv:2603.19453v2 Announce Type: replace Abstract: We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LL…

  251. arXiv cs.CL TIER_1 English(EN) · Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin ·

    BranPO:面向长时域代理强化学习的可扩展对比分支采样

    arXiv:2602.03719v2 Announce Type: replace Abstract: Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly a…

  252. arXiv cs.CL TIER_1 English(EN) · Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Jie Ouyang, Yucong Luo, Yitong Zhou, Qi Liu, Enhong Chen ·

    Agent-R1:用于智能体强化学习的统一模块化框架

    arXiv:2511.14460v2 Announce Type: replace Abstract: Large language models (LLMs) have rapidly evolved from single-turn text generators into the foundation of increasingly capable agents. As these agents take on more complex reasoning, decision making, tool use, and long-horizon t…

  253. arXiv cs.CL TIER_1 English(EN) · Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang ·

    深度研究作为强化学习的评价标准

    arXiv:2606.01091v1 Announce Type: new Abstract: Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- e…

  254. arXiv cs.CL TIER_1 English(EN) · Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang ·

    内化温度:基于策略的自蒸馏作为强化学习的策略再加热器

    arXiv:2606.00755v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learnin…

  255. arXiv cs.AI TIER_1 English(EN) · Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang ·

    从对比视角重新审视具有可验证奖励的强化学习

    arXiv:2605.12969v3 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulat…

  256. arXiv cs.AI TIER_1 English(EN) · Dogan Urgun, Gokhan Gungor ·

    用于合作多智能体强化学习的大语言模型引导的激励感知奖励设计

    arXiv:2603.24324v4 Announce Type: replace-cross Abstract: Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient groundi…

  257. arXiv cs.AI TIER_1 English(EN) · Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng ·

    HALO:通过异构智能体Lyapunov策略优化学习人机协作

    arXiv:2603.03741v2 Announce Type: replace-cross Abstract: To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inh…

  258. arXiv cs.AI TIER_1 English(EN) · Sam Dauncey, Roger Wattenhofer ·

    你可以通过强化学习端到端地学习分词

    arXiv:2602.13940v2 Announce Type: replace-cross Abstract: Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown prom…

  259. arXiv cs.AI TIER_1 English(EN) · Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker ·

    多任务强化学习的概率性能保证

    arXiv:2602.02098v2 Announce Type: replace-cross Abstract: Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are i…

  260. arXiv cs.AI TIER_1 English(EN) · Hongyu Lin, Yuchen Li, Haoran Luo, Zhenghong Lin, Libo Zhang, Mingjie Xing, Yanjun Wu ·

    TuneAgent:基于强化学习的智能体操作系统内核调优

    arXiv:2508.12551v2 Announce Type: replace-cross Abstract: Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space, sparse performance feedback, and strong workload sensitivity. We present TuneAgent…

  261. arXiv cs.AI TIER_1 Deutsch(DE) · Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang ·

    MARFT:多智能体强化微调

    arXiv:2504.16129v5 Announce Type: replace-cross Abstract: Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to s…

  262. arXiv cs.AI TIER_1 English(EN) · Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han ·

    LLM 引导的合作多智能体强化学习通信

    arXiv:2605.18077v2 Announce Type: replace Abstract: Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state informa…

  263. arXiv cs.AI TIER_1 English(EN) · Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng ·

    关于LLM智能体主动推理强化学习中的信息自锁

    arXiv:2603.12109v2 Announce Type: replace Abstract: Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task horizons. However, in active reasoning where agents must elicit new observations through …

  264. arXiv cs.AI TIER_1 English(EN) · Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai ·

    MulFeRL:在多轮循环中通过口头反馈增强强化学习

    arXiv:2601.22900v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed sample…

  265. arXiv cs.AI TIER_1 English(EN) · Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao ·

    OpenWebRL:揭秘面向视觉网络代理的在线多轮强化学习

    arXiv:2606.02031v1 Announce Type: cross Abstract: Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open a…

  266. arXiv cs.AI TIER_1 English(EN) · Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma ·

    隐式漂移策略:通过条件专家几何实现单步动作生成

    arXiv:2606.01098v1 Announce Type: cross Abstract: Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, the…

  267. arXiv cs.AI TIER_1 English(EN) · Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu ·

    基于Transformer世界模型的行为不变任务表示学习用于离线元强化学习

    arXiv:2606.00780v1 Announce Type: cross Abstract: Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and poli…

  268. arXiv cs.AI TIER_1 English(EN) · Rui Zhang, Xinle Wu, Yao Lu ·

    CARE-RL:能力感知强化学习用于缓解跨域冲突

    arXiv:2606.00609v1 Announce Type: cross Abstract: Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capabilit…

  269. arXiv cs.AI TIER_1 English(EN) · Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu, Huanwei Di, Jiang Liu, Jialian Wu, Zhengzhong Liu, Zicheng Liu, Emad Barsoum, Dimitris N. Metaxas, Hongyi Wang ·

    PR2:基于MoE的大型语言模型强化学习的预测路由回放

    arXiv:2606.00395v1 Announce Type: cross Abstract: Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A root cause is router drift, i.e., expert …

  270. arXiv cs.AI TIER_1 English(EN) · Jonathan Cola\c{c}o Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy ·

    基于成对偏好的强化学习在长期决策问题中的应用

    arXiv:2606.00367v1 Announce Type: cross Abstract: Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that…

  271. arXiv cs.AI TIER_1 English(EN) · Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo ·

    Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

    arXiv:2606.00151v1 Announce Type: cross Abstract: In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy i…

  272. arXiv cs.AI TIER_1 English(EN) · Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han ·

    Harness-1:用于具有状态外化束的搜索代理的强化学习

    arXiv:2606.02373v1 Announce Type: new Abstract: Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actual…

  273. arXiv cs.AI TIER_1 English(EN) · Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai ·

    SIRI:用于 LLM 代理训练的具有内在技能的自内化强化学习

    arXiv:2606.02355v1 Announce Type: new Abstract: Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, contex…

  274. arXiv cs.AI TIER_1 English(EN) · Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson ·

    Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

    arXiv:2606.02337v1 Announce Type: new Abstract: Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure a…

  275. arXiv cs.AI TIER_1 English(EN) · Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang ·

    学习何时不采取行动:缓解代理强化学习中的工具滥用

    arXiv:2606.02132v1 Announce Type: new Abstract: Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which…

  276. arXiv cs.AI TIER_1 English(EN) · Vignesh Subramanian, {\DJ}or{\dj}e \v{Z}ikeli\'c, Suguman Bansal ·

    基于证书的强化学习泛化评估

    arXiv:2606.00840v1 Announce Type: new Abstract: This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, charact…

  277. arXiv cs.AI TIER_1 English(EN) · Edwin Hamel-De le Court, Thom Badings, Alessandro Abate, Francesco Belardinelli, Francesco Fabiano ·

    Robust Shielding for Safe Reinforcement Learning

    arXiv:2606.00270v1 Announce Type: new Abstract: Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant trans…

  278. Hugging Face Daily Papers TIER_1 English(EN) ·

    EvoTrainer:自主智能体强化学习的联合演化大模型策略与训练工具链

    EvoTrainer autonomously evolves both language model policies and training harnesses through empirical feedback, demonstrating superior performance in complex reasoning and coding tasks compared to traditional handcrafted approaches.

  279. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Jiawei Han ·

    Harness-1:用于具有状态外化束的搜索代理的强化学习

    Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation …

  280. arXiv cs.AI TIER_1 English(EN) · Xunliang Cai ·

    SIRI:用于LLM智能体训练的具有内在技能的自内化强化学习

    Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Sel…

  281. arXiv cs.AI TIER_1 English(EN) · Anders Jonsson ·

    Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

    Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination…

  282. arXiv cs.LG TIER_1 English(EN) · Jan Peters ·

    具有学习奖励的大型行为模型的相干离策略改进

    Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these polici…

  283. arXiv cs.AI TIER_1 English(EN) · Liang Wang ·

    学习何时不采取行动:缓解代理强化学习中的工具滥用

    Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress use…

  284. Hugging Face Daily Papers TIER_1 English(EN) ·

    OpenWebRL:揭秘面向视觉网络代理的在线多轮强化学习

    Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-trai…

  285. arXiv cs.CL TIER_1 English(EN) · Jianfeng Gao ·

    OpenWebRL:揭秘面向视觉网页智能体的在线多轮强化学习

    Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-trai…

  286. arXiv cs.LG TIER_1 English(EN) · Baptiste Debes, Tinne Tuytelaars ·

    使用切片散度进行多元分布强化学习

    arXiv:2605.31222v1 Announce Type: new Abstract: Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dime…

  287. arXiv cs.LG TIER_1 English(EN) · Faiq Shamass ·

    ZAPS-DA:用于强化学习连续控制的解耦Actor的零相位动作策略平滑

    arXiv:2605.30612v1 Announce Type: cross Abstract: Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but in…

  288. arXiv cs.LG TIER_1 English(EN) · Giseung Park, Hyunyoung Nam, Woohyeon Byeon, Amir Leshem, Youngchul Sung ·

    具有最大最小准则的约束多目标强化学习

    arXiv:2605.31388v1 Announce Type: new Abstract: Multi-Objective Reinforcement Learning (MORL) extends standard RL by optimizing policies with respect to multiple, often conflicting, objectives. While max-min MORL has emerged as an effective approach for promoting fairness, its ap…

  289. arXiv cs.LG TIER_1 English(EN) · Mateusz Odrowaz-Sypniewski, Jasmine Bayrooti, Ajay Shankar, Amanda Prorok ·

    多智能体强化学习中的通用意图建模

    arXiv:2605.31318v1 Announce Type: new Abstract: Modeling an opponent's intent is critical for effective decision-making in non-cooperative, competitive, and general-sum multi-agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived…

  290. arXiv cs.LG TIER_1 English(EN) · Franki Nguimatsia-Tiofack, Fabian Schramm, Th\'eotime Le Hellard, Justin Carpentier ·

    生存强化学习:迈向可扩展的自监督强化学习

    arXiv:2605.31273v1 Announce Type: new Abstract: While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due t…

  291. arXiv cs.LG TIER_1 English(EN) · Tobias Lademann, Th\'eo Vincent, Jan Peters, Matthias Weigold ·

    使用强化学习控制工业能源系统所面临的挑战

    arXiv:2605.31044v1 Announce Type: new Abstract: Reinforcement learning has shown promising results for optimizing the control of industrial energy systems, yet most existing studies remain limited to the application in simulation environments. We investigate the challenges of dep…

  292. arXiv cs.LG TIER_1 English(EN) · Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald ·

    零崩溃:策略梯度方法在不连续奖励环境中的一种失效模式

    arXiv:2605.30896v1 Announce Type: new Abstract: Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited…

  293. arXiv cs.LG TIER_1 English(EN) · Enoch Hyunwook Kang ·

    关于离线强化学习与逆强化学习的讲义,第二部分:逆强化学习与动态离散选择模型基础

    arXiv:2605.30843v1 Announce Type: new Abstract: In the forward reinforcement-learning problem, the reward is fixed and known; the learner is asked to find a good policy or value function. Here we turn the question around. Given offline data generated by an expert, can we recover …

  294. arXiv cs.LG TIER_1 English(EN) · Ha Manh Bui, Metod Jazbec, Eric Nalisnick, Anqi Liu ·

    面向离线到在线强化学习的高效且不确定性感知的扩散框架

    arXiv:2605.30776v1 Announce Type: new Abstract: Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Ex…

  295. arXiv cs.CL TIER_1 English(EN) · Magnus J{\o}rgenv{\aa}g, David Kacz\'er, Lasse Ruttert, Marvin G\"ulhan, Lucie Flek, Florian Mai ·

    强化学习放大了无害奖励产生的潜在失调

    arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setti…

  296. arXiv cs.CL TIER_1 English(EN) · Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng ·

    RLHF的另一面:用于奖励模型自监督改进的On-Policy反馈

    arXiv:2605.30888v1 Announce Type: new Abstract: Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the pol…

  297. arXiv cs.AI TIER_1 English(EN) · Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han ·

    面向鲁棒多智能体强化学习的交互式对抗学习框架

    arXiv:2605.18024v2 Announce Type: replace-cross Abstract: Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered…

  298. arXiv cs.AI TIER_1 English(EN) · Franki Nguimatsia Tiofack, Fabian Schramm, Th\'eotime Le Hellard, Justin Carpentier ·

    SVL:以目标为条件的强化学习作为生存学习

    arXiv:2604.17551v2 Announce Type: replace-cross Abstract: Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and su…

  299. arXiv cs.AI TIER_1 English(EN) · Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik ·

    REAL:LLM-as-a-Judge 的回归感知强化学习

    arXiv:2603.17145v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typicall…

  300. arXiv cs.AI TIER_1 English(EN) · Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong ·

    HiPER:用于大型语言模型代理的具有显式信用分配的分层强化学习

    arXiv:2602.16165v2 Announce Type: replace-cross Abstract: Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before rec…

  301. arXiv cs.AI TIER_1 English(EN) · Tomas Leroy-Stone ·

    梦见他人:多智能体强化学习中世界模型的潜在队友建模

    arXiv:2605.31361v1 Announce Type: cross Abstract: In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong general…

  302. arXiv cs.AI TIER_1 English(EN) · Amir Esterhuysen, Anders Jonsson ·

    强化学习中的终端表示

    arXiv:2605.31289v1 Announce Type: cross Abstract: Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The …

  303. arXiv cs.AI TIER_1 English(EN) · Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael F\"arber, Xun Xiao, Volker Tresp, Yunpu Ma ·

    EchoRL:通过回滚回声实现的强化学习

    arXiv:2605.31228v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the…

  304. arXiv cs.AI TIER_1 English(EN) · Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang ·

    无最优演示者的逆强化学习:一种可行的奖励集方法

    arXiv:2605.30903v1 Announce Type: cross Abstract: Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study r…

  305. arXiv cs.AI TIER_1 English(EN) · Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui ·

    DARTS:用于加速大语言模型强化学习的分布感知主动回滚轨迹塑造

    arXiv:2605.30859v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long ta…

  306. arXiv cs.AI TIER_1 English(EN) · Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson ·

    通过状态增强和共识实现可分离动力学的可扩展约束多智能体强化学习

    arXiv:2605.30461v1 Announce Type: cross Abstract: We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have…

  307. arXiv cs.AI TIER_1 English(EN) · Rafael Bankosegger, Thomas Eiter, Johannes Oetsch ·

    基于 Answer Set Programming 的强化学习抽象

    arXiv:2605.31444v1 Announce Type: new Abstract: Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are t…

  308. arXiv cs.AI TIER_1 English(EN) · Mustafa Anis Hussain, Xinle Wu, Yao Lu ·

    面向深度研究的以规划者为中心的强化学习与结构感知奖励

    arXiv:2605.30824v1 Announce Type: new Abstract: Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or…

  309. arXiv cs.AI TIER_1 English(EN) · Ahmed Abouelazm, Felix Klingebiel, Philip Sch\"orner, J. Marius Z\"ollner ·

    强化学习自动驾驶中的不确定性感知和时间正则化专家建议

    arXiv:2605.30576v1 Announce Type: new Abstract: Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framewor…

  310. Hugging Face Daily Papers TIER_1 English(EN) ·

    OpenWebRL:揭秘面向视觉网络代理的在线多轮强化学习

    OpenWebRL presents a framework for training visual web agents using online reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision.

  311. Hugging Face Daily Papers TIER_1 English(EN) ·

    Harness-1:用于具有状态外化约束的搜索代理的强化学习

    A 20B search agent trained with reinforcement learning within a stateful search framework demonstrates superior retrieval performance across multiple domains by separating semantic decision-making from environmental bookkeeping.

  312. arXiv cs.AI TIER_1 English(EN) · Johannes Oetsch ·

    基于 Answer Set Programming 的强化学习抽象

    Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Lea…

  313. arXiv cs.LG TIER_1 English(EN) · Youngchul Sung ·

    具有最大最小准则的约束多目标强化学习

    Multi-Objective Reinforcement Learning (MORL) extends standard RL by optimizing policies with respect to multiple, often conflicting, objectives. While max-min MORL has emerged as an effective approach for promoting fairness, its applicability remains limited, particularly when c…

  314. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tomas Leroy-Stone ·

    梦见他人:多智能体强化学习世界模型中的潜在队友建模

    In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent sett…

  315. arXiv cs.CL TIER_1 English(EN) · Florian Mai ·

    强化学习放大了无害奖励产生的潜在不一致性

    Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcem…

  316. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Amanda Prorok ·

    多智能体强化学习中的通用意图建模

    Modeling an opponent's intent is critical for effective decision-making in non-cooperative, competitive, and general-sum multi-agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived from episode information chosen a priori, such …

  317. arXiv cs.AI TIER_1 English(EN) · Anders Jonsson ·

    强化学习中的终端表示

    Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they …

  318. arXiv cs.LG TIER_1 English(EN) · Justin Carpentier ·

    生存强化学习:迈向可扩展的自监督强化学习

    While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma inherent in c…

  319. arXiv cs.AI TIER_1 English(EN) · Yunpu Ma ·

    EchoRL:通过回滚回声实现的强化学习

    Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Sp…

  320. arXiv cs.LG TIER_1 English(EN) · Tinne Tuytelaars ·

    使用切片散度进行多元分布强化学习

    Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dimension or lose computational tractability, and th…

  321. arXiv cs.LG TIER_1 English(EN) · Yifu Zheng ·

    RL2ML:从强化学习到最大似然的有限回合代理目标

    arXiv:2605.30154v1 Announce Type: new Abstract: Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite …

  322. arXiv cs.LG TIER_1 English(EN) · Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang ·

    通过序列分数分解实现离线多智能体强化学习

    arXiv:2505.05968v3 Announce Type: replace Abstract: Offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts, particularly stemming from the high dimensionality of joint action spaces and the presence of out-of-distributio…

  323. arXiv cs.LG TIER_1 English(EN) · Feiyang Wu, Ye Zhao, Anqi Wu ·

    分布逆强化学习

    arXiv:2510.03013v4 Announce Type: replace Abstract: We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a de…

  324. arXiv cs.LG TIER_1 English(EN) · Yuehu Gong, Zeyuan Wang, Yulin Chen, Shutong Ding, Qingyuan Zhou, Yanwei Fu ·

    广义薛定谔桥下的路径空间镜像下降用于在线强化学习

    arXiv:2603.21621v2 Announce Type: replace Abstract: Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose ex…

  325. arXiv cs.AI TIER_1 English(EN) · James Rudd-Jones, Mirco Musolesi, Mar\'ia P\'erez-Ortiz ·

    关于混沌动力学系统中的分布强化学习

    arXiv:2605.30160v1 Announce Type: cross Abstract: Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamic…

  326. arXiv cs.AI TIER_1 English(EN) · Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang ·

    HPO:用于稀疏奖励机制下稳定高效训练的滞后策略优化

    arXiv:2605.30201v1 Announce Type: cross Abstract: We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, w…

  327. arXiv cs.AI TIER_1 English(EN) · Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian ·

    BORA:连接离线强化学习与在线残差自适应,赋能真实世界灵巧VLA模型

    arXiv:2605.30226v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to…

  328. arXiv cs.AI TIER_1 English(EN) · Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu ·

    具有鲁棒评分标准的强化学习

    arXiv:2605.30244v1 Announce Type: cross Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, r…

  329. arXiv cs.AI TIER_1 English(EN) · Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen ·

    基于生成轨迹策略的离线强化学习

    arXiv:2510.11499v2 Announce Type: replace-cross Abstract: Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow,…

  330. arXiv cs.AI TIER_1 English(EN) · Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng ·

    好的SFT优化SFT,更好的SFT为强化学习做准备

    arXiv:2602.01058v2 Announce Type: replace-cross Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT pe…

  331. arXiv cs.CL TIER_1 English(EN) · Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du ·

    PEARL:使用教学对齐强化学习训练苏格拉底式导师

    arXiv:2605.29582v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across mu…

  332. arXiv cs.CL TIER_1 English(EN) · Andy Q Han, David J. Chalmers, Pavel Izmailov ·

    你好吗?强化学习在语言模型中招募了一个功能性福利轴

    arXiv:2605.30232v1 Announce Type: cross Abstract: How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, rel…

  333. arXiv cs.LG TIER_1 English(EN) · Keru Chen ·

    信息导向的离线到在线强化学习

    arXiv:2605.29405v1 Announce Type: new Abstract: Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for e…

  334. arXiv cs.LG TIER_1 English(EN) · Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi ·

    具有批评者指导的样本高效扩散强化学习

    arXiv:2605.30056v1 Announce Type: cross Abstract: Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampli…

  335. arXiv cs.AI TIER_1 English(EN) · Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su ·

    SAAS:代理搜索中用于过度搜索缓解的自我意识强化学习

    arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize…

  336. arXiv cs.AI TIER_1 English(EN) · Matt Gorbett, Hossein Shirazi ·

    通过跨模型熵实现无标签强化学习

    arXiv:2605.29009v1 Announce Type: cross Abstract: Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness c…

  337. arXiv cs.AI TIER_1 English(EN) · Aalok Patwa ·

    大老二不完美信息下的自我博弈强化学习

    arXiv:2605.28863v1 Announce Type: cross Abstract: Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We deve…

  338. arXiv cs.AI TIER_1 English(EN) · Ritvik Rastogi, Vishal Singh, Tejas Chaudhari, Sandeep Varma ·

    Aryabhata 2:为高级STEM推理扩展强化学习

    arXiv:2605.28829v1 Announce Type: cross Abstract: Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models pe…

  339. arXiv cs.AI TIER_1 English(EN) · Geoffrey Bradway, Roger Creus Castanyer, Lorenz Wolf, Maxwill Lin, Matthew James Sargent, Augustine N. Mavor-Parker ·

    unix-ctf: 用于 Unix 能力强化学习的程序化环境

    arXiv:2605.29115v1 Announce Type: cross Abstract: Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Pytho…

  340. arXiv cs.AI TIER_1 English(EN) · Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng ·

    Hista 和 Numca:为 LLM 强化学习有效估计状态值

    arXiv:2605.29782v1 Announce Type: cross Abstract: Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an un…

  341. Hugging Face Daily Papers TIER_1 English(EN) ·

    RLHF的另一面:用于奖励模型自监督改进的On-Policy反馈

    SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives.

  342. arXiv cs.AI TIER_1 English(EN) · Dandan Tu ·

    基于鲁棒评分标准的强化学习

    While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide …

  343. arXiv cs.CL TIER_1 English(EN) · Pavel Izmailov ·

    你好吗?强化学习在语言模型中招募了一个功能性福利轴

    How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language mode…

  344. arXiv cs.AI TIER_1 English(EN) · Wenzhao Lian ·

    BORA:连接离线强化学习与在线残差自适应,赋能真实世界灵巧VLA模型

    Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding exe…

  345. arXiv cs.AI TIER_1 English(EN) · Haozhe Zhang ·

    HPO:用于稀疏奖励机制下稳定高效训练的滞后策略优化

    We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the …

  346. arXiv cs.AI TIER_1 English(EN) · María Pérez-Ortiz ·

    论混沌动力学系统中的分布强化学习

    Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains,…

  347. arXiv cs.LG TIER_1 English(EN) · Yifu Zheng ·

    RL2ML:从强化学习到最大似然的有限回合代理目标

    Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper d…

  348. arXiv cs.LG TIER_1 English(EN) · Ye Shi ·

    具有评论员指导的样本高效扩散式强化学习

    Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables …

  349. arXiv cs.CL TIER_1 English(EN) · Jinsong Su ·

    SAAS:用于缓解代理搜索中过度搜索的自我意识强化学习

    Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly trigger…

  350. arXiv cs.CL TIER_1 English(EN) · James Cheng ·

    Hista 和 Numca:为 LLM 强化学习有效估计状态值

    Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In thi…

  351. arXiv cs.LG TIER_1 English(EN) · Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz ·

    大规模流量控制强化学习算法的即插即用基准测试

    arXiv:2601.15015v2 Announce Type: replace Abstract: Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical s…

  352. arXiv cs.AI TIER_1 English(EN) · Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang ·

    TRACER:用于合作多LLM推理的具有内部强化信用的回合级悔恨匹配

    arXiv:2605.28699v1 Announce Type: new Abstract: Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to mu…

  353. arXiv cs.AI TIER_1 English(EN) · Yiran Pang, Zhen Ni, Xiangnan Zhong ·

    异构环境下联邦强化学习的个性化观测归一化

    arXiv:2605.27385v1 Announce Type: cross Abstract: Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneo…

  354. arXiv cs.AI TIER_1 English(EN) · Gengyue Han, Yiheng Feng ·

    通过概率潜在嵌入和动态策略适应实现可迁移强化学习,用于仿真到现实部署

    arXiv:2605.27659v1 Announce Type: cross Abstract: Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environ…

  355. arXiv cs.AI TIER_1 English(EN) · Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang ·

    ProRL:通过修正策略梯度估计实现主动推荐的有效强化学习

    arXiv:2605.28293v1 Announce Type: cross Abstract: Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such seque…

  356. arXiv cs.AI TIER_1 English(EN) · Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang ·

    OGER:一种用于混合强化学习的鲁棒离线引导探索奖励

    arXiv:2604.18530v2 Announce Type: replace Abstract: Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy d…

  357. arXiv cs.AI TIER_1 English(EN) · Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo ·

    ECHO:测试时强化学习的熵-置信度混合优化

    arXiv:2602.02150v2 Announce Type: replace-cross Abstract: Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior …

  358. arXiv cs.AI TIER_1 English(EN) · Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang ·

    在测试时强化学习中利用多数投票检测和缓解正确答案灭绝窗口

    arXiv:2605.19444v2 Announce Type: replace-cross Abstract: Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most ref…

  359. arXiv cs.CL TIER_1 English(EN) · Jiapeng Zhu, Jianxiang Yu, Yibo Zhao, Chengcheng Han, Qi Gu, Xunliang Cai, Xiang Li, Weining Qian ·

    Skill0.5:在代理强化学习中进行联合技能内化和利用以实现分布外泛化

    arXiv:2605.28424v1 Announce Type: new Abstract: Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer …

  360. arXiv cs.CL TIER_1 English(EN) · Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet \"Ust\"un, Beyza Ermis ·

    Soft-SVeRL:具有软奖励的自验证强化学习

    arXiv:2605.28561v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable:…

  361. arXiv cs.CL TIER_1 English(EN) · Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li ·

    停止奖励幻觉步骤:面向小型推理模型的忠实度感知步进式强化学习

    arXiv:2602.05897v2 Announce Type: replace Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness halluc…

  362. arXiv cs.CL TIER_1 English(EN) · Siqi Guo, Ming Lin, Tianbao Yang ·

    DRTriton:大规模合成数据驱动的强化学习用于Triton内核生成

    arXiv:2603.21465v2 Announce Type: replace Abstract: Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA ker…

  363. arXiv cs.LG TIER_1 English(EN) · Wendi Li, Shawn Im, Sharon Li ·

    周期性熵爆发:智能体强化学习中的熵动力学

    arXiv:2605.27954v1 Announce Type: new Abstract: Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving the…

  364. arXiv cs.LG TIER_1 English(EN) · Kaiqiang Ke, Shenghong He, Chengdong Xu, Yuheng Luo, Xiangyuan Lan, Chao Yu ·

    面向长时域离线目标条件强化学习的自适应粗粒度到细粒度子目标精炼

    arXiv:2605.28127v1 Announce Type: new Abstract: Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarc…

  365. arXiv cs.LG TIER_1 English(EN) · Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin ·

    强化学习中的多令牌预测联合训练与最优系数校准

    arXiv:2605.28184v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraini…

  366. arXiv cs.LG TIER_1 English(EN) · Onno Eberhard, Claire Vernade, Michael Muehlebach ·

    Commit to the Bit:正确实现反应式强化学习

    arXiv:2605.28276v1 Announce Type: new Abstract: Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation…

  367. arXiv cs.LG TIER_1 English(EN) · Mingjie Hu, Jian-Qiang Hu, Enlu Zhou ·

    强化学习的最优数据采集:大偏差视角

    arXiv:2605.28675v1 Announce Type: new Abstract: Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified l…

  368. arXiv cs.LG TIER_1 English(EN) · Renye Yan, Yaozhong Gan, You Wu, Junliang Xing, Ling Liangn, Yeshang Zhu, Yimao Cai ·

    AdaMemento:用于强化学习的自适应记忆辅助策略优化

    arXiv:2410.04498v2 Announce Type: replace Abstract: In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply s…

  369. arXiv cs.LG TIER_1 English(EN) · Amir Moeini, Minjae Kwon, Alper Kamil Bozkurt, Yuichi Motai, Rohan Chandra, Lu Feng, Shangtong Zhang ·

    安全上下文强化学习

    arXiv:2509.25582v3 Announce Type: replace Abstract: In-context reinforcement learning (ICRL) is an emerging RL paradigm where an agent, after pretraining, can adapt to out-of-distribution test tasks without any parameter updates, instead relying on an expanding context of interac…

  370. arXiv cs.LG TIER_1 English(EN) · Xinyu Liu, Zixuan Xie, Shangtong Zhang ·

    Robbins-Siegmund 定理的推广及其在强化学习中的应用

    arXiv:2509.26442v2 Announce Type: replace Abstract: The Robbins-Siegmund theorem establishes the convergence of stochastic processes that are almost supermartingales and is one of the most commonly used approaches for analyzing stochastic iterative algorithms in stochastic approx…

  371. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Hui Xiong ·

    LLM-ALSO:用于多智能体强化学习的LLM驱动自适应学习信号优化

    Effective training-time guidance is central to multi-agent reinforcement learning (MARL), yet remains difficult in sparse-reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual d…

  372. Hugging Face Daily Papers TIER_1 English(EN) ·

    SAAS:代理搜索中用于过度搜索缓解的自我意识强化学习

    SAAS introduces a reinforcement learning framework that enhances agent self-awareness to reduce unnecessary searches in LLM-based question answering systems.

  373. arXiv cs.AI TIER_1 English(EN) · Wentao Zhang ·

    TRACER:用于合作多LLM推理的具有内部强化信用的回合级悔恨匹配

    Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dil…

  374. arXiv cs.LG TIER_1 English(EN) · Enlu Zhou ·

    强化学习的最优数据采集:大偏差视角

    Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition i…

  375. arXiv cs.CL TIER_1 English(EN) · Beyza Ermis ·

    Soft-SVeRL:具有软奖励的自验证强化学习

    Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, response…

  376. arXiv cs.CL TIER_1 English(EN) · Weining Qian ·

    Skill0.5:在代理强化学习中进行联合技能内化和利用以实现分布外泛化

    Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. …

  377. Hugging Face Daily Papers TIER_1 English(EN) ·

    用于具有多项Logit函数近似的强化学习的方差自适应最优算法

    Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance…

  378. Hugging Face Daily Papers TIER_1 English(EN) ·

    拥抱比特:正确实现响应式强化学习

    Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access non-Markovia…

  379. Hugging Face Daily Papers TIER_1 English(EN) ·

    面向长时域离线目标条件强化学习的自适应粗粒度到细粒度子目标精炼

    Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introd…

  380. arXiv cs.AI TIER_1 English(EN) · Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Yinxi Li, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, … ·

    职位:可验证奖励强化学习的隐藏成本和测量差距

    arXiv:2509.21882v3 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet wel…

  381. arXiv cs.AI TIER_1 English(EN) · Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee ·

    重新思考大语言模型强化学习中的信任区域

    arXiv:2602.04879v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the…

  382. arXiv cs.AI TIER_1 English(EN) · Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen ·

    AMARIS:用于基于评分卡的强化学习的增强记忆评分卡改进系统

    arXiv:2605.18592v2 Announce Type: replace-cross Abstract: Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such a…

  383. arXiv cs.CL TIER_1 English(EN) · Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang ·

    高效的基于智能体的强化学习与同策略内在知识边界增强

    arXiv:2605.26952v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's …

  384. arXiv cs.CL TIER_1 English(EN) · Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza ·

    SOLE-R1:将视频语言推理作为机器人强化学习的唯一奖励

    arXiv:2603.28730v2 Announce Type: replace-cross Abstract: Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL…

  385. arXiv cs.LG TIER_1 English(EN) · Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu ·

    通过扩散策略优化扩展世界模型强化学习

    arXiv:2605.26282v1 Announce Type: new Abstract: Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bia…

  386. arXiv cs.LG TIER_1 English(EN) · Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev ·

    受约束强化学习的随机决策视界

    arXiv:2602.04599v2 Announce Type: replace Abstract: We propose stochastic decision horizons (SDH), a theoretically grounded framework for solving constrained RL problems with every-step constraint satisfaction, a desirable property in many real-world applications. In SDH, a const…

  387. arXiv cs.LG TIER_1 English(EN) · Jingwei Song, Meng Chen, Jie Xiao, Qingnan Ren, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Zhisheng Chen, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Lynn Ai, Eric Yang, Tianyu Shi ·

    ECHO-2:一种用于成本效益强化学习的大规模分布式部署框架

    arXiv:2602.02192v5 Announce Type: replace Abstract: Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout executio…

  388. arXiv cs.LG TIER_1 English(EN) · Tingting Ni, Maryam Kamgarpour ·

    具有可证明测试时安全性的约束元强化学习

    arXiv:2601.21845v2 Announce Type: replace Abstract: Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in im…

  389. arXiv cs.LG TIER_1 English(EN) · Yousef Koka, David Selby, Gerrit Gro{\ss}mann, Kathan Pandya, Sebastian Vollmer ·

    CleanSurvival:使用强化学习实现时间事件模型自动化数据预处理

    arXiv:2502.03946v5 Announce Type: replace Abstract: Data preprocessing is often paid little attention in machine learning, despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data prep…

  390. arXiv cs.LG TIER_1 English(EN) · Dhruv S. Kushwaha, Zoleikha A. Biron ·

    用于安全 Actor-Critic 强化学习的鲁棒 Koopman 控制屏障滤波器

    arXiv:2605.26452v1 Announce Type: cross Abstract: Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a prin…

  391. arXiv cs.LG TIER_1 English(EN) · Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu, Feng Hong, Xinmu Ge, Lin Yuan, Weichang Wu, Qiang Hu, Xiaolu Zhang, Jun Zhou, Jiangchao Yao ·

    Focal Reward:基于规则的奖励下的平衡强化学习

    arXiv:2605.26579v1 Announce Type: new Abstract: The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imb…

  392. arXiv cs.LG TIER_1 English(EN) · Barsat Khadka ·

    MechRL:强化学习代理执行电路发现以实现机制可解释性

    arXiv:2605.26343v1 Announce Type: new Abstract: Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new…

  393. arXiv cs.AI TIER_1 English(EN) · Yanfei Zhang, Xu Lin, Chenglin Wu ·

    StepOPSD:面向智能体强化学习的步进感知在线偏好蒸馏

    arXiv:2605.27140v1 Announce Type: new Abstract: Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides…

  394. arXiv cs.AI TIER_1 English(EN) · Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua ·

    在噪声下学习行动:通过噪声环境增强代理的鲁棒性

    arXiv:2605.27209v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents of…

  395. arXiv cs.AI TIER_1 English(EN) · Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee ·

    对齐篡改:人类反馈强化学习如何被利用来优化失调的偏见

    arXiv:2605.27355v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoin…

  396. arXiv cs.AI TIER_1 English(EN) · Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An ·

    超越轨迹级归因:基于图的信用分配用于智能体强化学习

    arXiv:2605.26684v1 Announce Type: cross Abstract: Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies…

  397. arXiv cs.AI TIER_1 English(EN) · Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao ·

    Tournament-GRPO:用于开放式长文本生成强化学习的组内锦标赛奖励

    arXiv:2605.26958v1 Announce Type: cross Abstract: Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scor…

  398. arXiv cs.AI TIER_1 English(EN) · Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas ·

    规划后行动:LLM推理的高层规划指导强化学习

    arXiv:2510.01833v2 Announce Type: replace Abstract: Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reas…

  399. arXiv cs.AI TIER_1 English(EN) · Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti ·

    基于超网络的持续模型强化学习

    arXiv:2009.11997v3 Announce Type: replace-cross Abstract: Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be statio…

  400. Hugging Face Daily Papers TIER_1 English(EN) ·

    Skill0.5:在代理强化学习中实现技能内化与利用的联合,以实现分布外泛化

    Skill0.5 is a novel agentic reinforcement learning framework that combines general skill internalization with task-specific skill utilization through a dynamic, difficulty-aware router to improve performance in complex task environments.

  401. Hugging Face Daily Papers TIER_1 English(EN) ·

    ProRL:通过修正策略梯度估计实现主动推荐的有效强化学习

    Proactive recommender systems using reinforcement learning face challenges with gradient estimation bias and variance, which are addressed through stepwise reward centering and position-specific advantage estimation mechanisms.

  402. Hugging Face Daily Papers TIER_1 English(EN) ·

    通过最优系数校准实现强化学习中的多令牌预测联合训练

    Reinforcement Learning from Verifiable Rewards and Multi-Token Prediction are combined through optimal coefficient calibration to improve joint training performance in mathematical reasoning benchmarks.

  403. arXiv cs.AI TIER_1 English(EN) · Kimin Lee ·

    对齐篡改:人类反馈强化学习如何被利用来优化不当偏见

    Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, c…

  404. arXiv cs.AI TIER_1 English(EN) · Tat-Seng Chua ·

    在噪声下学习行动:通过噪声环境增强代理的鲁棒性

    Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in…

  405. arXiv cs.AI TIER_1 English(EN) · Chenglin Wu ·

    StepOPSD:面向代理强化学习的步进感知在线偏好蒸馏

    Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically t…

  406. arXiv cs.AI TIER_1 English(EN) · Jiaxin Mao ·

    Tournament-GRPO:用于开放式长文本生成强化学习的组内锦标赛奖励

    Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrat…

  407. arXiv cs.CL TIER_1 English(EN) · Jie Jiang ·

    高效基于策略的智能体强化学习与内在知识边界增强

    Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fa…

  408. Hugging Face Daily Papers TIER_1 English(EN) ·

    Focal Reward:基于规则的奖励下的平衡强化学习

    The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubr…

  409. arXiv cs.AI TIER_1 English(EN) · Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He ·

    Agent World Model: Agentic Reinforcement Learning 的无限合成环境

    arXiv:2602.10090v3 Announce Type: replace Abstract: Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable e…

  410. arXiv cs.AI TIER_1 English(EN) · Pengyi Li, Jianye Hao, Hongyao Tang, Xian Fu, Yan Zheng, Ke Tang ·

    融合进化算法与强化学习:混合算法综合综述

    arXiv:2401.11963v5 Announce Type: replace-cross Abstract: Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) for optimization, has demonstrated remarkable performance advancements. By fusing both approaches, E…

  411. arXiv cs.AI TIER_1 English(EN) · Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang ·

    语言模型通用推理的耦合变分强化学习

    arXiv:2512.12576v3 Announce Type: replace-cross Abstract: While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing t…

  412. arXiv cs.AI TIER_1 English(EN) · Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang ·

    用于具有可验证奖励的强化学习的上下文部署老虎机

    arXiv:2602.08499v2 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and sho…

  413. arXiv cs.AI TIER_1 English(EN) · Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li ·

    STAPO:通过抑制稀有虚假标记来稳定LLM的强化学习

    arXiv:2602.15620v5 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain sta…

  414. arXiv cs.AI TIER_1 English(EN) · Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang ·

    面向可验证奖励的样本高效强化学习的折扣 Beta-Bernoulli 奖励估计

    arXiv:2603.18444v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often s…

  415. arXiv cs.AI TIER_1 English(EN) · Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao ·

    重新思考序列级强化学习中的比较单元:从损失修正到样本构建的等长配对训练框架

    arXiv:2604.17328v2 Announce Type: replace-cross Abstract: This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insuff…

  416. arXiv cs.CL TIER_1 English(EN) · Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang ·

    DVAO:多奖励强化学习的动态方差自适应优势优化

    arXiv:2605.25604v1 Announce Type: new Abstract: Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal…

  417. arXiv cs.CL TIER_1 English(EN) · Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung, Baojian Zhou ·

    从去噪反馈中进行强化学习

    arXiv:2605.25638v1 Announce Type: new Abstract: Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training para…

  418. arXiv cs.CL TIER_1 English(EN) · Wenlong Deng, Jiaji Huang, Kaan Ozkara, Yushu Li, Christos Thrampoulidis, Xiaoxiao Li, Youngsuk Park ·

    定向对齐缓解语言模型强化学习中的奖励破解

    arXiv:2605.25189v1 Announce Type: cross Abstract: Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and arg…

  419. arXiv cs.CL TIER_1 English(EN) · Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin ·

    当自信误导时:用于具有可验证奖励的强化学习的主动标签获取

    arXiv:2605.25864v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for rew…

  420. arXiv cs.CL TIER_1 English(EN) · Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Chen Qian, Zhiyuan Liu, Maosong Sun ·

    CPMobius:用于无数据强化学习的迭代教练-玩家推理

    arXiv:2602.02979v3 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through superv…

  421. arXiv cs.LG TIER_1 English(EN) · Meichen Song, Yuhao Wang, Enlu Zhou ·

    在线强化学习中的鲁棒性-探索权衡演化:基于分位数贝叶斯风险MDP

    arXiv:2605.24345v1 Announce Type: new Abstract: In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this ti…

  422. arXiv cs.LG TIER_1 English(EN) · Noah Farr, Aryaman Reddi, Carlo D'Eramo, Jan Peters ·

    部分可观测性下的流式强化学习与实时循环学习

    arXiv:2605.24709v1 Announce Type: new Abstract: Streaming reinforcement learning has emerged as an online learning paradigm that conforms to the restrictions of natural learning agents that process data incrementally, i.e. with a batch size of 1 and no replay buffer. While stream…

  423. arXiv cs.LG TIER_1 English(EN) · Amogh Palasamudram, Jakub Svoboda, Suguman Bansal, Krishnendu Chatterjee ·

    强化学习用于可达性:保证渐近最优性

    arXiv:2605.24740v1 Announce Type: new Abstract: Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, t…

  424. arXiv cs.LG TIER_1 English(EN) · Zuyuan Zhang ·

    A Contractive Feedback Semantics for Reinforcement Learning

    arXiv:2605.24759v1 Announce Type: new Abstract: Discounted reinforcement learning is usually presented through Bellman equations on closed Markov decision processes. This paper develops a compositional view: a one-step decision process is treated as an open stochastic component, …

  425. arXiv cs.LG TIER_1 English(EN) · Zhongjian Qiao, Jiafei Lyu, Chenjia Bai, Peisong Wang, Siyang Gao, Shuang Qiu ·

    在具有异构数据集的跨域离线强化学习中统一价值对齐与分配

    arXiv:2605.24862v1 Announce Type: new Abstract: Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source d…

  426. arXiv cs.LG TIER_1 English(EN) · Shruti Mishra, Michael Chang, Vamsi Spandan, Shmuel M. Rubinstein ·

    关于强化学习挑战的流体机械环境的视角

    arXiv:2605.25011v1 Announce Type: new Abstract: We consider the challenge of developing agents that efficiently interact with high-dimensional, evolving environments, towards a view of practical reinforcement learning (RL) agents interacting with open worlds, of which they witnes…

  427. arXiv cs.LG TIER_1 English(EN) · Hyungkyu Kang, Byeongchan Kim, Min-hwan Oh ·

    面向离线目标条件强化学习的潜在表征对齐

    arXiv:2605.25740v1 Announce Type: new Abstract: Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remai…

  428. arXiv cs.LG TIER_1 English(EN) · Zhaoyu Zhu, Rui Gao, Shuang Li ·

    熵正则化强化学习的全局 Wasserstein 策略梯度收敛

    arXiv:2605.26078v1 Announce Type: new Abstract: Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state…

  429. arXiv cs.LG TIER_1 English(EN) · Jayprakash S. Nair, Jimson Mathew, Shivashankar B. Nair ·

    受强化学习启发的潜在产出自适应算法切换机制

    arXiv:2605.24436v1 Announce Type: cross Abstract: Selecting the most suitable algorithm for a given problem instance remains a challenging task, particularly in online or dynamic environments where problem characteristics evolve over time. Relying solely on instantaneous performa…

  430. arXiv cs.LG TIER_1 English(EN) · Rei Higuchi, Ryotaro Kawata, Akifumi Wachi, Shokichi Takakura, Kohei Miyaguchi, Taiji Suzuki ·

    神经奖励模型如何学习策略优化特征:单指标分析

    arXiv:2605.24749v1 Announce Type: cross Abstract: Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study t…

  431. arXiv cs.LG TIER_1 English(EN) · Jingyi Li, Peng Wu, Chengchun Shi ·

    反事实安全强化学习

    arXiv:2605.25114v1 Announce Type: cross Abstract: Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety conc…

  432. arXiv cs.LG TIER_1 English(EN) · Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski ·

    面向目标条件强化学习的测试时图搜索

    arXiv:2510.07257v2 Announce Type: replace Abstract: Offline goal-conditioned reinforcement learning (GCRL) often struggles with long-horizon tasks, where errors in value estimation accumulate and produce unreliable policies. It is typically assumed that effective long-term planni…

  433. arXiv cs.AI TIER_1 English(EN) · Lei Ding, Bin He, Chenguang Wang, Yang Liu ·

    ProActor:面向主动任务调度代理的感知时序强化学习

    arXiv:2605.24900v1 Announce Type: new Abstract: Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instru…

  434. arXiv cs.LG TIER_1 English(EN) · Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai ·

    GeMPO:用于在线扩散强化学习的通用度量匹配

    arXiv:2603.10250v2 Announce Type: replace Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over samples from the behavior policy, which often induces an overgreedy policy and fails to utilize feedback from negative samples. In …

  435. arXiv cs.AI TIER_1 English(EN) · Chengwei Li, Junlin Liu, Yang Gao ·

    进化增强多智能体强化学习用于合作空战

    arXiv:2605.25091v1 Announce Type: new Abstract: As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state …

  436. arXiv cs.AI TIER_1 English(EN) · Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu ·

    IVR-R1:在强化学习中通过迭代式视觉基础推理优化轨迹

    arXiv:2605.23997v1 Announce Type: cross Abstract: Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visua…

  437. arXiv cs.AI TIER_1 English(EN) · Changling Li, Ying Li ·

    为任务导向型无人机网络扩展具有个体奖励的能源感知多智能体强化学习

    arXiv:2605.24992v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone netw…

  438. arXiv cs.AI TIER_1 English(EN) · Sohaib Lafifi ·

    约束锚定归因:可行性认证的反事实和 Bonferroni-PAC 充分子集用于神经 CO 策略

    arXiv:2605.25235v1 Announce Type: cross Abstract: We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via LP-relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility mo…

  439. arXiv cs.AI TIER_1 English(EN) · Minjae Kwon, Amir Moeini, Shangtong Zhang, Lu Feng ·

    用于安全上下文强化学习的潜在Q-Barrier屏蔽

    arXiv:2605.25267v1 Announce Type: cross Abstract: Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pret…

  440. arXiv cs.AI TIER_1 English(EN) · Aleksandar Todorov, Matthia Sabatelli ·

    在低维子空间中学习:强化学习的正交瓶颈

    arXiv:2605.26012v1 Announce Type: cross Abstract: Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we presen…

  441. arXiv cs.AI TIER_1 English(EN) · In-Chang Baek, Sung-Hyun Kim, Sam Earle, Zehua Jiang, Jin-Ha Noh, Julian Togelius, Kyung-Joong Kim ·

    PCGRLLM:大型语言模型驱动的程序化内容生成强化学习奖励设计

    arXiv:2502.10906v2 Announce Type: replace Abstract: Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and cont…

  442. arXiv cs.AI TIER_1 English(EN) · Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, S… ·

    通过早期经验进行智能体学习

    arXiv:2510.08558v3 Announce Type: replace Abstract: A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning re…

  443. arXiv cs.AI TIER_1 English(EN) · Yuheng Jing, Kai Li, Ziwen Zhang, Jiajun Zhang, Zeyao Ma, Jiaxi Yang, Lei Zhang, Zhe Wu, Jinmin He, Junliang Xing, Jian Cheng ·

    基准测试即时团队协作中上下文强化学习的极限

    arXiv:2605.24423v1 Announce Type: new Abstract: In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To ri…

  444. arXiv cs.AI TIER_1 English(EN) · Lirong Che, Yuzhe yang, Peiwen lin, Chuang wang, Xueqian wang, Jian su ·

    DemoEvolve:通过演示克服代理模型演化中的稀疏反馈

    arXiv:2605.24539v1 Announce Type: new Abstract: Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can …

  445. Hugging Face Daily Papers TIER_1 English(EN) ·

    在噪声下学习行动:通过噪声环境增强代理的鲁棒性

    NoisyAgent is an agentic training framework that incorporates environmental imperfections into agent learning to improve robustness in real-world stochastic settings.

  446. Hugging Face Daily Papers TIER_1 English(EN) ·

    对齐篡改:人类反馈强化学习如何被利用来优化不当偏见

    Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling.

  447. Hugging Face Daily Papers TIER_1 English(EN) ·

    基于策略的内在知识边界增强的高效智能体强化学习

    AKBE enhances LLM agent training by dynamically identifying when tools are needed versus when internal knowledge suffices, improving accuracy and reducing unnecessary tool usage through targeted supervisory signals.

  448. arXiv cs.LG TIER_1 English(EN) · Shuang Li ·

    熵正则化强化学习的全局 Wasserstein 策略梯度收敛

    Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the…

  449. Hugging Face Daily Papers TIER_1 English(EN) ·

    熵正则化强化学习的全局 Wasserstein 策略梯度收敛

    Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the…

  450. arXiv cs.AI TIER_1 English(EN) · Matthia Sabatelli ·

    在低维子空间中学习:强化学习的正交瓶颈

    Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prio…

  451. arXiv cs.LG TIER_1 English(EN) · Guojun Yin ·

    当自我信念误导时:用于具有可验证奖励的强化学习的主动标签获取

    Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often…

  452. arXiv cs.LG TIER_1 English(EN) · Min-hwan Oh ·

    Offline Goal-Conditioned Reinforcement Learning 的潜在表示对齐

    Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remains challenging. In this paper, we identify erron…

  453. arXiv cs.CL TIER_1 English(EN) · Baojian Zhou ·

    从去噪反馈中进行强化学习

    Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollo…

  454. arXiv cs.CL TIER_1 English(EN) · Yuewei Zhang ·

    DVAO:多奖励强化学习的动态方差自适应优势优化

    Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world …

  455. arXiv cs.AI TIER_1 English(EN) · Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama ·

    在不完美验证器下,具有可验证但嘈杂奖励的强化学习

    arXiv:2510.00915v4 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably…

  456. arXiv cs.AI TIER_1 English(EN) · Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto ·

    基于模型的强化学习中搜索的惊人难度

    arXiv:2601.21306v2 Announce Type: replace-cross Abstract: This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, s…

  457. arXiv cs.AI TIER_1 English(EN) · Chenglin Li, Grant Ruan, Hua Geng ·

    基于偏好的约束推理的安全强化学习

    arXiv:2603.23565v2 Announce Type: replace-cross Abstract: Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constra…

  458. arXiv cs.CL TIER_1 English(EN) · Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang ·

    从正确性到偏好:个性化智能体强化学习的框架

    arXiv:2605.23382v1 Announce Type: new Abstract: Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different plann…

  459. arXiv cs.CL TIER_1 English(EN) · Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu ·

    ARES:可扩展大型语言模型强化学习的自动化评分标准合成

    arXiv:2605.23454v1 Announce Type: new Abstract: Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches…

  460. arXiv cs.LG TIER_1 English(EN) · Zitian Li, Wang Chi Cheung ·

    Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

    arXiv:2605.23182v1 Announce Type: new Abstract: Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enou…

  461. arXiv cs.AI TIER_1 English(EN) · Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Cl\'ement Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport ·

    Infra-Bayesian 强化学习智能体 在最坏情况鲁棒性方面优于经典强化学习

    arXiv:2605.23146v1 Announce Type: cross Abstract: Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate…

  462. arXiv cs.AI TIER_1 English(EN) · Yongyan Wen, Siyuan Li, Mingjian Fu, Yiqin Yang, Xun Wang, Peng Liu ·

    具有可衡量任务表示学习的课程强化学习

    arXiv:2605.23372v1 Announce Type: cross Abstract: In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challe…

  463. arXiv cs.AI TIER_1 English(EN) · Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu ·

    Reflex:利用反射对称性进行状态驱动的连续控制强化学习

    arXiv:2605.23415v1 Announce Type: cross Abstract: Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction …

  464. arXiv cs.AI TIER_1 English(EN) · Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, C\'edric Colas, Jakob Foerster ·

    目标条件智能体一次性学会一切

    arXiv:2605.23551v1 Announce Type: cross Abstract: A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goa…

  465. arXiv cs.AI TIER_1 English(EN) · Elie Abboud, Oren Gal ·

    ARMS:稀疏奖励多智能体强化学习的自动奖励塑造

    arXiv:2605.23562v1 Announce Type: cross Abstract: Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in t…

  466. arXiv cs.AI TIER_1 English(EN) · Jason Ross Brown, Edward James Young ·

    理解序列强化学习中的目标泛化

    arXiv:2605.23565v1 Announce Type: cross Abstract: Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on…

  467. arXiv cs.AI TIER_1 English(EN) · Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li ·

    R$^3$L:基于语言引导探索、关键信用分配和正向放大的反射-重试强化学习

    arXiv:2601.03715v2 Announce Type: replace-cross Abstract: Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks…

  468. Hugging Face Daily Papers TIER_1 English(EN) ·

    DVAO:多奖励强化学习的动态方差自适应优势优化

    Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.

  469. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Ying Li ·

    为任务导向的无人机网络扩展具有个体奖励的能源感知多智能体强化学习

    Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to addres…

  470. Hugging Face Daily Papers TIER_1 English(EN) ·

    方向对齐缓解语言模型强化学习中的奖励破解问题

    Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.

  471. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Shivashankar B. Nair ·

    受强化学习启发的基于潜在产量的自适应算法切换机制

    Selecting the most suitable algorithm for a given problem instance remains a challenging task, particularly in online or dynamic environments where problem characteristics evolve over time. Relying solely on instantaneous performance metrics can result in a reactive and unstable …

  472. arXiv cs.AI TIER_1 English(EN) · Edward James Young ·

    理解序列强化学习中的目标泛化

    Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for a…

  473. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Oren Gal ·

    ARMS:稀疏奖励多智能体强化学习的自动奖励塑造

    Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strate…

  474. arXiv cs.AI TIER_1 English(EN) · Jakob Foerster ·

    目标条件智能体一次性学习所有内容

    A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is us…

  475. arXiv cs.CL TIER_1 English(EN) · Dayiheng Liu ·

    ARES:可扩展大型语言模型强化学习的自动化评分标准合成

    Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manual…

  476. arXiv cs.AI TIER_1 English(EN) · Yanhua Yu ·

    Reflex:利用反射对称性进行状态驱动的连续控制强化学习

    Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotat…

  477. arXiv cs.CL TIER_1 English(EN) · Chao Wang ·

    从正确性到偏好:个性化智能体强化学习的框架

    Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across use…

  478. arXiv cs.AI TIER_1 English(EN) · Peng Liu ·

    具有可衡量任务表示学习的课程强化学习

    In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on …

  479. arXiv cs.LG TIER_1 English(EN) · D. Sorokin, A. Kostin, L. Savchenko, G. Gusev, A. V. Savchenko ·

    TreeDQN:组合优化中的样本高效离轨强化学习

    arXiv:2306.05905v2 Announce Type: replace Abstract: A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by…

  480. arXiv cs.LG TIER_1 English(EN) · Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada ·

    重新审视用于双人博弈中稳定高效强化学习的正则化策略优化

    arXiv:2602.10894v2 Announce Type: replace Abstract: Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularizati…

  481. arXiv cs.LG TIER_1 English(EN) · Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban ·

    异构体协作强化学习

    arXiv:2603.02604v2 Announce Type: replace Abstract: We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi-agent on-policy optimization. …

  482. arXiv cs.LG TIER_1 English(EN) · Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao ·

    SWE-MiniSandbox:无需容器的强化学习用于构建软件工程代理

    arXiv:2602.11210v4 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur s…

  483. arXiv cs.LG TIER_1 English(EN) · Rupak Majumdar, Nikhil Singh, Sadegh Soudjani ·

    深度强化学习中的基于核的安全探索

    arXiv:2605.22207v1 Announce Type: cross Abstract: Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emph{barrier function…

  484. arXiv cs.LG TIER_1 English(EN) · Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster ·

    面向离线目标条件强化学习的抽象

    arXiv:2605.22711v1 Announce Type: new Abstract: Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been…

  485. arXiv cs.LG TIER_1 English(EN) · Benjamin Poole, Andrew Quinn, Li Yang, Minwoo Lee ·

    莫忘批评者:多周期持续强化学习的基于价值的数据复习

    arXiv:2605.22454v1 Announce Type: new Abstract: Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due t…

  486. arXiv cs.LG TIER_1 English(EN) · Wei Liu, Ting Long ·

    面向目标的 Bellman 备份用于跨域离线强化学习

    arXiv:2605.22376v1 Announce Type: new Abstract: Cross-domain offline reinforcement learning (CDRL) aims to improve policy learning in a target domain by leveraging data collected from a source domain. Existing works typically assess the transferability of source-domain data by me…

  487. arXiv cs.LG TIER_1 English(EN) · Stefan Huber, Hannes Unger, Georg Sch\"afer, Jakob Rehrl ·

    切比雪夫策略与山地车问题:低维控制任务的强化学习

    arXiv:2605.22305v1 Announce Type: new Abstract: We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simpl…

  488. arXiv cs.LG TIER_1 English(EN) · Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt ·

    用于奖励引导扩散的分层变分策略

    arXiv:2605.21661v1 Announce Type: new Abstract: Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples…

  489. arXiv cs.CL TIER_1 English(EN) · Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi ·

    通用偏好强化学习

    arXiv:2605.18721v3 Announce Type: replace-cross Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a prog…

  490. arXiv cs.AI TIER_1 English(EN) · Xingwei Gan, Ying Zhu ·

    在LLM的后训练中,通过logit平均将SFT与强化学习相结合

    arXiv:2605.20555v1 Announce Type: cross Abstract: We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning…

  491. arXiv cs.AI TIER_1 English(EN) · Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter ·

    分布感知奖励:用于LLM回归的预测分布强化学习

    arXiv:2605.20740v1 Announce Type: cross Abstract: Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point est…

  492. arXiv cs.AI TIER_1 English(EN) · Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton ·

    行为一致性深度强化学习

    arXiv:2605.21214v2 Announce Type: cross Abstract: Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run…

  493. arXiv cs.AI TIER_1 English(EN) · Xiaocan Li, Shiliang Wu, Zheng Shen ·

    LLM强化学习中MXFP4量化误差的分解:可约偏置、可恢复死区和不可约基线

    arXiv:2605.20402v1 Announce Type: cross Abstract: MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error…

  494. arXiv cs.AI TIER_1 English(EN) · Yonghyeon Jo, Sunwoo Lee, Seungyul Han ·

    在多智能体强化学习中保留次优动作以跟随变化的最优策略

    arXiv:2602.17062v2 Announce Type: replace Abstract: Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts du…

  495. arXiv cs.AI TIER_1 English(EN) · Nasehatul Mustakim, Lucas Lehnert ·

    更小的抽象状态空间赋能强化学习中的跨尺度泛化

    arXiv:2605.20272v1 Announce Type: cross Abstract: While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Dis…

  496. arXiv cs.AI TIER_1 English(EN) · Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu ·

    FBOS-RL:反馈驱动的双目标协同强化学习

    arXiv:2605.20256v1 Announce Type: cross Abstract: Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy up…

  497. arXiv cs.AI TIER_1 English(EN) · Andrew Choi, Wei Xu ·

    RankQ:通过自监督动作排序实现离线到在线强化学习

    arXiv:2605.11151v2 Announce Type: replace Abstract: Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces wi…

  498. arXiv cs.AI TIER_1 English(EN) · Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han ·

    面向基于技能的元强化学习的自改进技能学习

    arXiv:2502.03752v5 Announce Type: replace-cross Abstract: Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable s…

  499. arXiv cs.AI TIER_1 English(EN) · Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han ·

    严格子目标执行:分层强化学习中可靠的远期规划

    arXiv:2506.21039v3 Announce Type: replace-cross Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solution…

  500. arXiv cs.AI TIER_1 English(EN) · Carlo Romeo, Andrew D. Bagdanov ·

    ARC-RL:受《ARC Raiders》启发的强化学习游乐场

    arXiv:2605.19503v2 Announce Type: replace-cross Abstract: Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, how…

  501. arXiv cs.CL TIER_1 English(EN) · Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang ·

    从推理链到可验证子问题:课程强化学习实现大语言模型推理的信用分配

    arXiv:2605.22074v1 Announce Type: cross Abstract: Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit a…

  502. arXiv cs.CL TIER_1 English(EN) · Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao ·

    Maestro:强化学习用于编排分层模型-技能集成

    arXiv:2605.22177v1 Announce Type: cross Abstract: The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with th…

  503. arXiv cs.AI TIER_1 English(EN) · Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh ·

    用于具有可验证奖励的强化学习的多步似然比校正

    arXiv:2605.20865v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local…

  504. arXiv cs.AI TIER_1 English(EN) · Jakob Foerster ·

    面向离线目标条件强化学习的抽象

    Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal ab…

  505. arXiv cs.AI TIER_1 English(EN) · Minwoo Lee ·

    别忘了批评者:基于价值的数据排演用于多周期持续强化学习

    Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic…

  506. arXiv cs.CL TIER_1 English(EN) · Jianhua Tao ·

    Maestro:强化学习用于编排分层模型-技能集成

    The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottlene…

  507. arXiv cs.CL TIER_1 English(EN) · Gao Huang ·

    从推理链到可验证子问题:课程强化学习赋能大模型推理的信用分配

    Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed at…

  508. Hugging Face Daily Papers TIER_1 English(EN) ·

    从推理链到可验证子问题:课程强化学习赋能大模型推理的信用分配

    SCRL addresses inefficiencies in reinforcement learning from verifiable rewards by using subproblem-level normalization for finer credit assignment and curriculum learning, improving mathematical reasoning performance on challenging benchmarks.

  509. Hugging Face Daily Papers TIER_1 English(EN) ·

    Maestro:强化学习用于编排分层模型-技能集成

    A reinforcement learning-driven orchestration framework dynamically composes expert models and skills for multimodal tasks, achieving superior performance with low computational overhead.

  510. arXiv cs.CL TIER_1 English(EN) · Yankai Lin ·

    DelTA:用于可验证奖励强化学习的判别性令牌信用分配

    Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understo…

  511. arXiv cs.AI TIER_1 English(EN) · Eric Eaton ·

    行为一致性深度强化学习

    Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of b…

  512. arXiv cs.LG TIER_1 English(EN) · Mira Mezini ·

    面向代码生成的领域自适应强化学习与稠密奖励

    Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for planning and executing actions, awarene…

  513. Hugging Face Daily Papers TIER_1 English(EN) ·

    通过自适应批次缩放实现可扩展的在线策略强化学习

    Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data …

  514. Hugging Face Daily Papers TIER_1 English(EN) ·

    用于具有可验证奖励的强化学习的多步似然比校正

    Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient object…

  515. arXiv cs.AI TIER_1 English(EN) · Min-hwan Oh ·

    用于具有可验证奖励的强化学习的多步似然比校正

    Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient object…

  516. arXiv cs.AI TIER_1 English(EN) · Alan Ritter ·

    分布感知奖励:用于LLM回归的预测分布上的强化学习

    Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive dist…

  517. Hugging Face Daily Papers TIER_1 English(EN) ·

    DelTA:用于可验证奖励强化学习的判别性令牌信用分配

    Reinforcement learning from verifiable rewards is enhanced through a discriminative token credit assignment method that improves reward-based training by amplifying distinctive token-gradient directions and reducing noise from shared patterns.

  518. Hugging Face Daily Papers TIER_1 English(EN) ·

    LLM强化学习中MXFP4量化误差的分解:可约偏置、可恢复死区和不可约基底

    MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct …

  519. arXiv cs.LG TIER_1 English(EN) · Julie Josse ·

    集合值策略学习

    Conventional treatment policies map patient covariates to a single recommended intervention in order to maximize expected clinical outcomes. Although a rich body of causal inference methods has been developed to estimate such policies, point-valued recommendations can be highly s…

  520. arXiv cs.CL TIER_1 English(EN) · Han Li ·

    GoLongRL:面向能力的长期上下文强化学习与多任务对齐

    We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths,…

  521. Hugging Face Daily Papers TIER_1 English(EN) ·

    ARC-RL:受《ARC Raiders》启发的强化学习游乐场

    Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-t…

  522. Hugging Face Daily Papers TIER_1 English(EN) ·

    当多数投票错误时,测试时强化学习的干预时机隐藏在消亡窗口中

    Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than g…

  523. Hugging Face Daily Papers TIER_1 English(EN) ·

    ParaVT:解决代理视频强化学习中并行工具使用的工具先验悖论

    ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.

  524. arXiv cs.CL TIER_1 English(EN) · John M. Cioffi ·

    通用偏好强化学习

    Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, whil…

  525. Hugging Face Daily Papers TIER_1 English(EN) ·

    通用偏好强化学习

    Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, whil…

  526. arXiv cs.AI TIER_1 English(EN) · Zhiyu Chen ·

    AMARIS:一种用于基于评分卡的强化学习的增强记忆评分卡改进系统

    Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollout…

  527. Hugging Face Daily Papers TIER_1 English(EN) ·

    AMARIS:用于基于评分卡的强化学习的增强记忆评分卡改进系统

    Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollout…

  528. arXiv cs.AI TIER_1 English(EN) · Hendrik Baier ·

    可解释的程序化强化学习框架:能够进行调度的AI

    Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architecture…

  529. Hugging Face Daily Papers TIER_1 English(EN) ·

    可解释的程序化强化学习框架:能够表达的调度

    Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architecture…

  530. arXiv cs.AI TIER_1 English(EN) · Mark Fuge ·

    超越推理时搜索:强化学习合成可复用求解器

    Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into th…

  531. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Seungyul Han ·

    LLM 引导的合作多智能体强化学习通信

    Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-A…

  532. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Seungyul Han ·

    用于鲁棒多智能体强化学习的交互式对抗学习框架

    Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when …

  533. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jie Lu ·

    用于多智能体强化学习的异构信息瓶颈协调图

    Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods r…

  534. arXiv cs.LG TIER_1 English(EN) · Liang Zheng ·

    BAPR:用于非平稳连续控制的贝叶斯遗忘分段鲁棒强化学习

    Real-world control systems frequently operate under \emph{piecewise stationary} conditions, where dynamics remain stable for extended periods before undergoing abrupt regime changes. Standard robust RL methods face a fundamental dilemma: a globally conservative policy wastes perf…

  535. arXiv cs.CL TIER_1 English(EN) · José A. R. Fonallosa ·

    面向MT的无参考强化学习微调:Seq2Seq视角

    Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply…

  536. arXiv cs.LG TIER_1 English(EN) · Zihan Zhang ·

    Contextual Action-Set Reinforcement Learning 的更严格遗憾界限

    We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^…

  537. arXiv cs.CL TIER_1 English(EN) · Zhouxing Shi ·

    GRLO:迈向开放式环境中零样本通用强化学习

    Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (…

  538. arXiv cs.AI TIER_1 English(EN) · Yongliang Shen ·

    自蒸馏代理强化学习

    Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level gui…

  539. Hugging Face Daily Papers TIER_1 English(EN) ·

    通过随机选择的少样本引导增强具有可验证奖励的强化学习

    Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where corre…

  540. arXiv cs.AI TIER_1 English(EN) · Yu-Xiong Wang ·

    通过随机选择的少样本引导增强具有可验证奖励的强化学习

    Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where corre…

  541. arXiv cs.LG TIER_1 English(EN) · Min-hwan Oh ·

    Peng的Q($λ$)用于离线强化学习中的保守价值估计

    We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($λ$) (CPQL). Our algorithm adapts the Peng's Q($λ$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, th…

  542. arXiv cs.CL TIER_1 English(EN) · Qitian Wu ·

    解决行动瓶颈:基于令牌级能量的智能体强化学习

    Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to unifor…

  543. arXiv cs.CL TIER_1 English(EN) · Yaojie Lu ·

    从失败中学习:面向纠错的策略优化与可验证奖励

    Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optim…

  544. Hugging Face Daily Papers TIER_1 English(EN) ·

    ROAD:通过双层优化实现离线到在线强化学习的自适应数据混合

    Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely o…

  545. arXiv cs.CL TIER_1 English(EN) · Xunliang Cai ·

    通过奖励去相关策略优化实现多目标和混合奖励强化学习

    Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we …

  546. arXiv cs.AI TIER_1 English(EN) · Ahmed Khalifa ·

    为强化学习内容生成器学习局部约束

    Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforce…

  547. arXiv cs.AI TIER_1 English(EN) · Arnu Pretorius ·

    通过对比近端策略优化实现自监督策略强化学习

    Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL…

  548. arXiv cs.AI TIER_1 English(EN) · Minjoon Seo ·

    Q-Flow:基于流的策略实现稳定且富有表现力的强化学习

    There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization…

  549. Hugging Face Daily Papers TIER_1 English(EN) ·

    利用目标对齐生成技术弥合离线强化学习的领域差距

    Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly whe…

  550. Hugging Face Daily Papers TIER_1 English(EN) ·

    ChipMATE:通过强化学习进行多智能体训练以增强 RTL 生成

    Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and canno…

  551. Hugging Face Daily Papers TIER_1 English(EN) ·

    量化逆强化学习中潜在观测缺失

    Inverse reinforcement learning (IRL), which infers reward functions from demonstrations, is a valuable tool for modeling and understanding decision-making behavior. Many variants of IRL have been developed to capture complexities of human decision-making, such as subjective belie…

  552. arXiv cs.AI TIER_1 English(EN) · Yunzhong He ·

    基于规则的强化学习中的奖励劫持

    Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verif…

  553. arXiv cs.LG TIER_1 English(EN) · Amanda Prorok ·

    Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

    Effective multi-agent cooperation requires agents to adopt diverse behaviors as task conditions evolve-and to do so at the right moment. Yet, current Multi-Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind f…

  554. arXiv cs.AI TIER_1 English(EN) · Alexander J. Smola ·

    信任批次,策略内外:强化学习后训练的自适应策略优化

    Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sam…

  555. arXiv cs.AI TIER_1 English(EN) · Peizhong Ju ·

    离散流匹配用于离线到在线强化学习

    Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is its…

  556. arXiv cs.AI TIER_1 English(EN) · Shaowu Yang ·

    通过隐式因果图建模实现可迁移的延迟感知强化学习

    Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce th…

  557. arXiv cs.LG TIER_1 English(EN) · Shaowu Yang ·

    延迟增强因果分层强化学习

    Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their…

  558. arXiv cs.AI TIER_1 English(EN) · Abhishek Gupta ·

    TMRL:扩散时间步调制预训练可实现高效策略微调的探索

    Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a uni…

  559. arXiv cs.LG TIER_1 English(EN) · Jamison Heard ·

    深度强化学习的内在替代性条件作用

    Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the envir…

  560. arXiv cs.LG TIER_1 English(EN) · Guillaume Drion ·

    关于多稳态在强化学习中的泛化能力的重要性

    In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observa…

  561. arXiv cs.CL TIER_1 English(EN) · Fuli Feng ·

    SkillGraph:通过演进技能图谱实现智能体技能增强型强化学习

    Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent m…

  562. arXiv cs.CL TIER_1 English(EN) · Xiangxiang Chu ·

    从动作指导中学习代理策略

    Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional traini…

  563. arXiv cs.CL TIER_1 English(EN) · Xuanjing Huang ·

    强化微调中的熵极性:方向、不对称性与控制

    Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level me…

  564. arXiv cs.CL TIER_1 English(EN) · Hong Cheng ·

    Agentic Reinforcement Learning 的动态技能生命周期管理

    Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance…

  565. arXiv cs.AI TIER_1 English(EN) · Nicholas Bambos ·

    面向非马尔可夫强化学习的策略梯度方法

    We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provid…

  566. Hugging Face Daily Papers TIER_1 English(EN) ·

    面向非马尔可夫强化学习的策略梯度方法

    We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provid…

  567. arXiv cs.LG TIER_1 English(EN) · Jan Peters ·

    XQCfD:利用先验数据和先验策略加速快速Actor-Critic算法

    For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with spars…

  568. Hugging Face Daily Papers TIER_1 English(EN) ·

    安全离线强化学习的鲁棒概率屏蔽

    In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a…

  569. arXiv cs.AI TIER_1 English(EN) · Nils Jansen ·

    安全离线强化学习的鲁棒概率屏蔽

    In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a…

  570. arXiv cs.AI TIER_1 English(EN) · Michal Nauman ·

    强化学习中的非均匀回放何时重要?

    Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed…

  571. 量子位 (QbitAI) TIER_1 中文(ZH) · 闻乐 ·

    无需参数更新的强化学习!OpenAI 的 Jia-Yi Ong 提出新范式:决策仅需 AI 手工制作的 .py 文件

    实现过程开源可复现

  572. arXiv cs.LG TIER_1 English(EN) · Sanjay Bhat ·

    指数效用强化学习:折扣马尔可夫决策过程中的算法与收敛性

    Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied i…

  573. arXiv cs.LG TIER_1 English(EN) · Daniel Murfet ·

    Interpreting Reinforcement Learning Agents with Susceptibilities

    Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate …

  574. arXiv cs.AI TIER_1 Deutsch(DE) · Minhyuk Sung ·

    漂移场策略:通过 Wasserstein 梯度流实现的一步生成策略

    We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probabi…

  575. arXiv cs.CL TIER_1 English(EN) · Yohan Jo ·

    你的语言模型是它自己的批评者:基于演员内部状态的价值估计强化学习

    Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its em…

  576. arXiv cs.LG TIER_1 English(EN) · Hao Chen ·

    LiteGUI:利用强化学习提炼紧凑型GUI代理

    Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supe…

  577. arXiv cs.CL TIER_1 English(EN) · Miaohui Wang ·

    ExpThink:用于自适应思维链压缩的经验引导强化学习

    Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penal…

  578. arXiv cs.CL TIER_1 English(EN) · Yanghua Xiao ·

    SEIF:用于指令遵循的自演化强化学习

    Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training …

  579. arXiv cs.LG TIER_1 English(EN) · Shangtong Zhang ·

    超越线性注意力:Softmax Transformers 实现上下文强化学习

    In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the …

  580. arXiv cs.CL TIER_1 English(EN) · Stefano Soatto ·

    异构语言模型互增强学习中的经验分享

    We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Wo…

  581. arXiv cs.LG TIER_1 English(EN) · Tim Walter, Hannah Markgraf, Jonathan K\"ulz, Matthias Althoff ·

    在可证明安全的强化学习中利用分析梯度

    arXiv:2506.01665v4 Announce Type: replace Abstract: The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These saf…

  582. arXiv cs.LG TIER_1 English(EN) · David Leeftink, Max Hinne, Marcel van Gerven ·

    Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

    arXiv:2605.05373v1 Announce Type: new Abstract: A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement le…

  583. arXiv cs.LG TIER_1 English(EN) · Dillon Sandhu, Ronald Parr ·

    近似下一策略采样:替代深度强化学习中的保守目标策略更新

    arXiv:2605.05481v1 Announce Type: new Abstract: We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is u…

  584. arXiv cs.LG TIER_1 English(EN) · Nandiraju Gireesh, Yuanliang Ju, He Wang ·

    自适应Q-分块用于离线到在线强化学习

    arXiv:2605.05544v1 Announce Type: new Abstract: Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal:…

  585. arXiv cs.LG TIER_1 English(EN) · Cristiano da Costa Cunha, Ajmal Mian, Tim French, Wei Liu ·

    复杂纸牌游戏的因果强化学习:《万智牌》基准测试

    arXiv:2605.06066v1 Announce Type: new Abstract: Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium …

  586. arXiv cs.LG TIER_1 English(EN) · Alireza Modirshanechi, Benjamin Eysenbach, Peter Dayan, Eric Schulz ·

    通过控制最大化统一目标条件强化学习与无监督技能学习

    arXiv:2605.06145v1 Announce Type: new Abstract: Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information s…

  587. arXiv cs.LG TIER_1 English(EN) · Yaomin Wang, Jianting Pan, Ran Tian, Xiaoyang Li, Yu Zhang, Hengle Qin, Tianshu YU ·

    AdaGamma:强化学习中用于时间自适应的依赖状态折扣

    arXiv:2605.06149v1 Announce Type: new Abstract: The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is …

  588. arXiv cs.LG TIER_1 English(EN) · Hyunjun Na, Donghwan Lee ·

    Soft Deterministic Policy Gradient with Gaussian Smoothing

    arXiv:2605.06228v1 Announce Type: new Abstract: Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in pra…

  589. arXiv cs.LG TIER_1 English(EN) · Zuyuan Zhang, Fei Xu Yu, Tian Lan ·

    面向连续强化学习的算子引导不变性学习

    arXiv:2605.06500v1 Announce Type: new Abstract: Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve …

  590. arXiv cs.LG TIER_1 English(EN) · Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua ·

    关于隐式奖励过拟合和RLVR中的低秩动力学

    arXiv:2605.06523v1 Announce Type: new Abstract: Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated…

  591. arXiv cs.LG TIER_1 English(EN) · Dmitri Goloubentsev, Natalija Karpichina ·

    SNAPO:通过可微分模拟实现最优控制的平滑神经伴随策略优化

    arXiv:2605.06570v1 Announce Type: new Abstract: Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor …

  592. arXiv cs.LG TIER_1 English(EN) · Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song ·

    固定预算下最大化推广信息量:用于工具使用代理强化学习的树搜索的子模态视图

    arXiv:2605.05262v1 Announce Type: cross Abstract: We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnos…

  593. arXiv cs.LG TIER_1 English(EN) · Haodong Liang, Lifeng Lai ·

    Transformer 可证明地通过策略改进实现上下文强化学习

    arXiv:2605.05755v1 Announce Type: cross Abstract: We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-at…

  594. arXiv cs.LG TIER_1 English(EN) · Maria Ana Cardei, Matthew Landers, Afsaneh Doryab ·

    协调至关重要:合作多智能体强化学习的评估

    arXiv:2605.06557v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, pa…

  595. arXiv cs.LG TIER_1 English(EN) · David M\"uller, Agon Serifi, Sammy Christen, Ruben Grandia, Espen Knoop, Moritz B\"acher ·

    ReActor:用于物理感知运动重定向的强化学习

    arXiv:2605.06593v1 Announce Type: cross Abstract: Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motio…

  596. arXiv cs.LG TIER_1 English(EN) · Shuo Liu, Xinzichen Li, Christopher Amato ·

    基于多智能体强化学习的跨模态导航

    arXiv:2605.06595v1 Announce Type: cross Abstract: Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal…

  597. arXiv cs.LG TIER_1 English(EN) · Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Yanfeng Wang, Siheng Chen ·

    AceGRPO:自适应课程增强群组相对策略优化用于自主机器学习工程

    arXiv:2602.07906v5 Announce Type: replace Abstract: Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behaviora…

  598. arXiv cs.LG TIER_1 English(EN) · Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li ·

    上下文评分奖励的交替强化学习:超越标量化策略

    arXiv:2603.15646v2 Announce Type: replace Abstract: Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, m…

  599. arXiv cs.LG TIER_1 English(EN) · Jiaxin Liu, Anzhe Cheng, Paul Bogdan ·

    发现可控内容:强化学习的干预边界发现

    arXiv:2603.18257v2 Announce Type: replace Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observ…

  600. arXiv cs.LG TIER_1 English(EN) · Naveen Mysore ·

    用于检测强化学习中非马尔可夫观测的基于预测的马尔可夫违例分数

    arXiv:2603.27389v2 Announce Type: replace Abstract: Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance …

  601. arXiv cs.LG TIER_1 English(EN) · Yuan Zhuang, Yuexin Bian, Sihong He, Jie Feng, Qing Su, Songyang Han, Jonathan Petit, Shihao Ji, Yuanyuan Shi, Fei Miao ·

    低秩适应在离轨强化学习中的评论员学习应用

    arXiv:2604.18978v2 Announce Type: replace Abstract: Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training…

  602. arXiv cs.CL TIER_1 English(EN) · Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen ·

    面向长时域语言智能体的里程碑引导策略学习

    arXiv:2605.06078v1 Announce Type: new Abstract: While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where corr…

  603. arXiv cs.CL TIER_1 English(EN) · Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang ·

    A$^2$TGPO: 具有自适应回合级裁剪的智能体回合-组策略优化

    arXiv:2605.06200v1 Announce Type: new Abstract: Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions.…

  604. arXiv cs.CL TIER_1 English(EN) · Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin ·

    StraTA:通过战略轨迹抽象激励代理强化学习

    arXiv:2605.06642v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and…

  605. arXiv cs.CL TIER_1 English(EN) · Mingwei Xu, Hao Fang ·

    超越负面回滚:基于隐式负梯度实现仅正面策略优化

    arXiv:2605.06650v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change …

  606. arXiv cs.AI TIER_1 English(EN) · Yinbo Yu, Xueyu Yin, Jiadai Wang, Chunwei Tian, Sai Xu, Qi Zhu, Daoqiang Zhang ·

    BehaviorGuard:深度强化学习的在线后门防御

    arXiv:2605.05977v1 Announce Type: new Abstract: Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger pattern…

  607. arXiv cs.AI TIER_1 English(EN) · Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang ·

    Skill1:通过强化学习实现技能增强型智能体的统一演进

    arXiv:2605.06130v1 Announce Type: new Abstract: A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, a…

  608. arXiv cs.AI TIER_1 English(EN) · Haochen Cai, Xian Yu ·

    学习切割:用于Benders分解的强化学习

    arXiv:2605.06516v1 Announce Type: cross Abstract: Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem…

  609. arXiv cs.AI TIER_1 English(EN) · Claudio Fanconi, Nicol\'as Astorga, Mihaela van der Schaar ·

    从专家演示中学习逆强化学习的奖励

    arXiv:2510.01857v4 Announce Type: replace Abstract: Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or definin…

  610. arXiv cs.CL TIER_1 English(EN) · Hao Fang ·

    超越负面回滚:具有隐式负梯度 (Implicit Negative Gradients) 的仅正面策略优化

    Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to G…

  611. arXiv cs.AI TIER_1 English(EN) · Zhenfei Yin ·

    StraTA:通过战略轨迹抽象激励智能体强化学习

    Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. I…

  612. Hugging Face Daily Papers TIER_1 English(EN) ·

    基于多智能体强化学习的跨模态导航

    Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substan…

  613. arXiv cs.AI TIER_1 English(EN) · Christopher Amato ·

    基于多智能体强化学习的跨模态导航

    Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substan…

  614. arXiv cs.LG TIER_1 English(EN) · Moritz Bächer ·

    ReActor:用于物理感知运动重定向的强化学习

    Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We…

  615. arXiv cs.LG TIER_1 English(EN) · Natalija Karpichina ·

    SNAPO:通过可微分模拟实现最优控制的平滑神经伴随策略优化

    Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instance…

  616. arXiv cs.AI TIER_1 English(EN) · Afsaneh Doryab ·

    协调至关重要:合作多智能体强化学习的评估

    Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and jo…

  617. arXiv cs.AI TIER_1 English(EN) · Tat-Seng Chua ·

    关于隐式奖励过拟合和RLVR中的低秩动力学

    Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-…

  618. arXiv cs.AI TIER_1 English(EN) · Xian Yu ·

    学习切割:用于Benders分解的强化学习

    Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this …

  619. arXiv cs.AI TIER_1 English(EN) · Tian Lan ·

    面向连续强化学习的算子引导不变性学习

    Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on spec…

  620. arXiv cs.CL TIER_1 English(EN) · Jie Jiang ·

    A$^2$TGPO: 具有自适应回合级裁剪的智能体回合-组策略优化

    Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assi…

  621. arXiv cs.CL TIER_1 English(EN) · Yongliang Shen ·

    面向长时域语言智能体的里程碑引导策略学习

    While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal …

  622. arXiv cs.AI TIER_1 English(EN) · Karthik Soma, Yann Bouteiller, Heiko Hamann, Giovanni Beltrame ·

    蜂群思维是单一强化学习代理

    arXiv:2410.17517v5 Announce Type: replace-cross Abstract: Decision-making is an essential attribute of any intelligent agent or group. Natural systems are known to converge to effective strategies through at least two distinct mechanisms: collective decision-making via imitation …

  623. arXiv cs.LG TIER_1 English(EN) · Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang ·

    EP-GRPO:熵-进度对齐群组相对策略优化与隐式过程引导

    arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity …

  624. arXiv cs.LG TIER_1 English(EN) · Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, Miroslav Pajic, Yuichi Motai ·

    交互预算下的自适应策略选择与微调用于离线到在线强化学习

    arXiv:2605.05123v1 Announce Type: new Abstract: In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline,…

  625. arXiv cs.LG TIER_1 English(EN) · Shawn Ray ·

    Graph-SND:多智能体强化学习中行为多样性的稀疏聚合

    arXiv:2605.05020v1 Announce Type: new Abstract: System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-S…

  626. arXiv cs.LG TIER_1 English(EN) · Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, Shivaram Kalyanakrishnan ·

    在树状MDP中将策略视为老虎机臂进行在线学习

    arXiv:2605.04979v1 Announce Type: cross Abstract: A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of de…

  627. arXiv cs.LG TIER_1 English(EN) · Xiyan Fu, Wei Liu ·

    基于结果级别优化的组合泛化强化学习

    arXiv:2605.04920v1 Announce Type: new Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target …

  628. arXiv cs.LG TIER_1 English(EN) · Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka ·

    SMDP中平均奖励强化学习的调和平均数公式

    arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular i…

  629. arXiv cs.LG TIER_1 English(EN) · Lirui Luo, Guoxi Zhang, Hongming Xu, Cong Fang, Qing Li ·

    SPHERE:缓解深度强化学习中混合专家模型光谱可塑性损失

    arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes o…

  630. arXiv cs.LG TIER_1 English(EN) · Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama ·

    面向人类反馈在线强化学习的数据依赖探索

    arXiv:2605.04477v1 Announce Type: new Abstract: Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in t…

  631. arXiv cs.LG TIER_1 English(EN) · Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao, Hao Cheng, Sifa Zheng ·

    CRAFT:用于驾驶策略的逆事实到交互式强化微调

    arXiv:2605.04470v1 Announce Type: new Abstract: Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade…

  632. arXiv cs.LG TIER_1 English(EN) · Senne Deproost, Mehrdad Asadi, Ann Now\'e ·

    用于蒸馏黑盒强化学习策略的层级支持向量状态划分

    arXiv:2605.04254v1 Announce Type: new Abstract: We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with…

  633. arXiv cs.LG TIER_1 English(EN) · Qijun Liao, Zhaoxin Yu, Jue Yang ·

    基于动态解耦球形径向压缩的约束增强强化学习

    arXiv:2605.04185v1 Announce Type: new Abstract: When deploying reinforcement learning policies to physical robots, actuator rate constraints -- hard limits on how fast each joint can move per control step -- are unavoidable. These limits vary substantially across joints due to di…

  634. arXiv cs.LG TIER_1 English(EN) · Bilel Abderrahmane Benziane, Benoit Lardeux, Ayoub Mcharek, Maher Jridi ·

    为弹性需求预测设计双深度强化学习选择工具

    arXiv:2605.04068v1 Announce Type: new Abstract: The use of artificial intelligence in supply chain forecasting has attracted many scientific studies for several decades. However, the process of selecting an appropriate forecasting solution becomes a daunting task. This complexity…

  635. arXiv cs.CL TIER_1 English(EN) · Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang ·

    超越多数投票:迈向测试时强化学习的细粒度、更可靠的奖励信号

    arXiv:2512.15146v4 Announce Type: replace Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for impr…

  636. arXiv cs.LG TIER_1 English(EN) · Bj\"orn Hoppmann, Christoph Scholz ·

    Meta-学习与元强化学习——追溯DeepMind的自适应智能体之路

    arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning over…

  637. arXiv cs.LG TIER_1 English(EN) · Xiaoyuan Cheng, Wenxuan Yuan, Boyang Li, Yuanchao Xu, Yiming Yang, Hao Liang, Bei Peng, Robert Loftin, Zhuo Sun, Yukun Hu ·

    拉格朗日方法如何通过扩散模型引导安全强化学习?

    arXiv:2602.02924v2 Announce Type: replace Abstract: Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline set…

  638. arXiv cs.LG TIER_1 English(EN) · Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang ·

    论监督微调与强化学习在后训练中的不可解耦性

    arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs …

  639. arXiv cs.LG TIER_1 English(EN) · Peter N. Loxley ·

    使用过完备稀疏编码的高效强化学习:自然图像的最优控制

    arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this…

  640. arXiv cs.AI TIER_1 English(EN) · Thomas Weng ·

    当生活赋予你BC时,制作Q函数:从行为克隆中提取Q值用于机器人强化学习

    Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously l…

  641. arXiv cs.AI TIER_1 English(EN) · Yuichi Motai ·

    面向离线到在线强化学习的交互预算下的自适应策略选择与微调

    In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are …

  642. arXiv cs.AI TIER_1 English(EN) · Gabriel Nelson ·

    LineRides:用于自行车机器人特技的线引导强化学习

    Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framewor…

  643. arXiv cs.LG TIER_1 English(EN) · Shawn Ray ·

    Graph-SND:多智能体强化学习中行为多样性的稀疏聚合

    System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-SND, which replaces this complete-graph average w…

  644. arXiv cs.AI TIER_1 English(EN) · Shivaram Kalyanakrishnan ·

    在树状MDP中将策略视为老虎机臂进行在线学习

    A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect rec…

  645. arXiv cs.AI TIER_1 English(EN) · Zhisheng Yang ·

    EP-GRPO:熵-进度对齐群组相对策略优化与隐式过程引导

    Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, …

  646. arXiv cs.AI TIER_1 English(EN) · Gal A. Kaminka ·

    模块化强化学习用于合作蜂群

    A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement l…

  647. arXiv cs.CL TIER_1 English(EN) · Wei Liu ·

    基于结果级别优化的组合泛化强化学习

    Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fail…

  648. arXiv cs.AI TIER_1 English(EN) · Gal A. Kaminka ·

    SMDP中平均奖励强化学习的调和平均数公式

    Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastical…

  649. arXiv cs.LG TIER_1 English(EN) · Yuxin Bai, Aranyak Acharyya, Ashwin De Silva, Zeyu Shen, James Hassett, Joshua T. Vogelstein ·

    通过前瞻性学习和控制实现未来最优控制

    arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the …

  650. arXiv cs.LG TIER_1 English(EN) · Shan Yang, Yang Liu ·

    Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

    arXiv:2602.20078v3 Announce Type: replace-cross Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on …

  651. arXiv cs.LG TIER_1 English(EN) · Cyrille Kone, Kevin Jamieson ·

    Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

    arXiv:2605.03921v1 Announce Type: new Abstract: We study the $(\varepsilon, \delta)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer fr…

  652. arXiv cs.AI TIER_1 English(EN) · Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun ·

    T$^2$PO:用于稳定多轮Agentic强化学习的不确定性引导探索控制

    arXiv:2605.02178v1 Announce Type: new Abstract: Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and …

  653. arXiv cs.LG TIER_1 English(EN) · Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, ·

    生成、过滤、控制、回放:LLM强化学习部署策略的全面调查

    arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including…

  654. arXiv cs.AI TIER_1 English(EN) · Dahyun Oh, Minhyuk Yoon, H. Jin Kim ·

    面向协同多智能体强化学习的质量感知探索预算分配

    arXiv:2605.01865v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation…

  655. arXiv cs.LG TIER_1 English(EN) · Prakhar Gupta, Vaibhav Gupta ·

    RL训练后引导混合奖励:注入规范动作顺序

    arXiv:2512.04277v3 Announce Type: replace Abstract: Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during …

  656. arXiv cs.LG TIER_1 English(EN) · Jingchu Gai, Laixi Shi ·

    通过线性函数逼近驯服具有大状态空间的鲁棒马尔可夫博弈中的多机构诅咒

    arXiv:2605.03125v1 Announce Type: new Abstract: Multi-agent reinforcement learning (MARL) holds great potential but faces robustness challenges due to environmental uncertainty. To address this, distributionally robust Markov games (RMGs) optimize worst-case performance when the …

  657. arXiv cs.LG TIER_1 English(EN) · Kevin Jamieson ·

    Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

    We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to im…

  658. arXiv cs.CL TIER_1 English(EN) · Mehmet Iscan ·

    用于强化学习编码代理的反馈归一化开发者记忆:一个安全门控MCP架构

    arXiv:2605.01567v1 Announce Type: cross Abstract: Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic re…

  659. arXiv cs.CL TIER_1 English(EN) · Yifan Zhang, Lanser Contributors ·

    从编译器和语言服务器反馈中进行强化学习

    arXiv:2510.22907v2 Announce Type: replace Abstract: Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers…

  660. arXiv cs.AI TIER_1 English(EN) · Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu ·

    AEM:多轮智能体强化学习的自适应熵调制

    arXiv:2605.00425v1 Announce Type: new Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only re…

  661. arXiv cs.LG TIER_1 English(EN) · Ruoning Zhang, Siying Wang, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang, Ruijie Zhang, Stefano V. Albrecht ·

    Optimistic {\epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

    arXiv:2502.03506v2 Announce Type: replace-cross Abstract: The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, conventional methods based on CTDE can suffer from value underestimation and …

  662. arXiv cs.LG TIER_1 English(EN) · Jongsoo Lee, Jangwon Kim, Soohee Han ·

    具有延迟反馈环境的延迟同态强化学习

    arXiv:2604.03641v2 Announce Type: replace Abstract: Reinforcement learning in real-world systems often involves delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical augmentation-based approaches cause state-space explosion, which i…

  663. arXiv cs.LG TIER_1 English(EN) · Kejiang Qian, Amos Storkey, Fengxiang He ·

    Reinforcement Learning Agent的理性度量与理论

    arXiv:2602.04737v2 Announce Type: replace Abstract: This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it …

  664. arXiv cs.LG TIER_1 English(EN) · Lipeng Zu, Yu Qian, Shayok Chakraborty, Xiaonan Zhang ·

    从静态约束到动态适应:离线到在线强化学习的样本级约束释放

    arXiv:2511.03828v2 Announce Type: replace Abstract: Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves dur…

  665. arXiv cs.LG TIER_1 English(EN) · Juan Sebastian Rojas, Chi-Guhn Lee ·

    遍历风险度量:迈向持续强化学习的风险感知基础

    arXiv:2510.02945v3 Announce Type: replace Abstract: Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance…

  666. arXiv cs.LG TIER_1 English(EN) · Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner ·

    超越专业化:通过程序化地图生成器实现鲁棒强化学习导航

    arXiv:2605.02528v1 Announce Type: cross Abstract: Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. Wh…

  667. arXiv cs.LG TIER_1 English(EN) · Yiheng Zhang, Yiming Wang, Kaiyan Zhao, Zhenglin Wan, Jiayu Chen, Leong Hou U ·

    ANO:一种原则性的鲁棒策略优化方法

    arXiv:2605.02320v1 Announce Type: cross Abstract: Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clippin…

  668. arXiv cs.LG TIER_1 English(EN) · Haohan Yu, Jinmiao Cong, Shengzhi Wang, Lu Wang, Chanjuan Liu ·

    MAGIC: Multi-agent Reinforcement Learning 的多步优势门控因果影响

    arXiv:2605.01805v1 Announce Type: cross Abstract: A key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals necessitates the ability to quantify the true, long-term ca…

  669. arXiv cs.LG TIER_1 English(EN) · Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li ·

    面向多模态推理的段落对齐策略优化

    arXiv:2605.01327v1 Announce Type: cross Abstract: Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the na…

  670. arXiv cs.LG TIER_1 English(EN) · Marc Dymetman ·

    二元奖励与强化学习:基本挑战

    arXiv:2605.02375v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improve…

  671. arXiv cs.LG TIER_1 English(EN) · Sanjiv R. Das, Harshad Khadilkar, Sukrit Mittal, Daniel Ostrov, Deep Srivastav, Hungjen Wang ·

    一种基于元强化学习的财富管理方法

    arXiv:2605.02300v1 Announce Type: new Abstract: Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) …

  672. arXiv cs.LG TIER_1 English(EN) · Ujjwal Patil, Javad Ghofrani ·

    在强化学习中结合训练模型

    arXiv:2605.02159v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) has delivered strong results in domains such as Atari and Go, but it still suffers from high sample cost and weak transfer beyond the training setting. A common response is to reuse information from…

  673. arXiv cs.LG TIER_1 English(EN) · Rudray Dave, Vedang Dubey, Smit Deoghare, Sudhakar Mishra ·

    面向可验证奖励单次强化学习的选择器引导自主课程

    arXiv:2605.01823v1 Announce Type: new Abstract: Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-…

  674. arXiv cs.CL TIER_1 English(EN) · Seonglae Cho, Zekun Wu, Adriano Koshiyama ·

    控制强化学习:通过稀疏自编码器特征实现LLM的可解释的Token级引导

    arXiv:2602.10437v3 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Rei…

  675. Hugging Face Daily Papers TIER_1 English(EN) ·

    超越专业化:通过程序化地图生成器实现鲁棒强化学习导航

    Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable dive…

  676. arXiv cs.LG TIER_1 English(EN) · Peter Detzner ·

    超越专业化:通过程序化地图生成器实现鲁棒强化学习导航

    Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable dive…

  677. Hugging Face Daily Papers TIER_1 English(EN) ·

    以目标条件强化学习视角看中程物流

    Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs f…

  678. arXiv cs.LG TIER_1 English(EN) · Marc Dymetman ·

    二元奖励与强化学习:基本挑战

    Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes fal…

  679. Hugging Face Daily Papers TIER_1 English(EN) ·

    二元奖励与强化学习:基本挑战

    Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes fal…

  680. arXiv cs.LG TIER_1 English(EN) · Leong Hou U ·

    ANO:一种原则性的鲁棒策略优化方法

    Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gr…

  681. arXiv cs.LG TIER_1 English(EN) · Hungjen Wang ·

    一种基于元强化学习的财富管理方法

    Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems. Each GBWM problem involves a multiple …

  682. arXiv cs.LG TIER_1 English(EN) · Guangyu Zhao, Kewei Lian, Haoxuan Ru, Borong Zhang, Haowei Lin, Zhancun Mu, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang ·

    偏好目标调优:训练后作为冻结策略的潜在控制

    arXiv:2412.02125v2 Announce Type: replace-cross Abstract: Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass …

  683. arXiv cs.LG TIER_1 English(EN) · Jiaming Zhang, Yujie Yang, Yao Lyu, Shengbo Eben Li, Liping Zhang ·

    用于强化学习状态安全性的增强拉格朗日乘子网络

    arXiv:2605.00667v1 Announce Type: new Abstract: Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires …

  684. arXiv cs.LG TIER_1 English(EN) · Washim Uddin Mondal, Vaneet Aggarwal ·

    约束MDP中一般参数化策略的最后迭代收敛

    arXiv:2408.11513v2 Announce Type: replace Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entrop…

  685. arXiv cs.LG TIER_1 English(EN) · Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon ·

    无显式保守的长视野基于模型的离线强化学习

    arXiv:2512.04341v3 Announce Type: replace Abstract: Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bay…

  686. arXiv cs.CL TIER_1 English(EN) · Zhichao Wang (James), Kiran Ramnath (James), Bin Bi (James), Shiva Kumar Pentyala (James), Sougata Chaudhuri (James), Shubham Mehrotra (James), Zixu (James), Zhu (Claire), Xiang-Bo Mao (Claire), Sitaram Asur (Claire), Na (Claire), Cheng ·

    LLM 训练后强化学习:综述

    arXiv:2407.16216v3 Announce Type: replace Abstract: Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training…

  687. arXiv cs.LG TIER_1 English(EN) · Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rong Luo, Jing Gao ·

    PORTool:具有奖励树的面向重要性的策略优化,用于多工具集成推理

    arXiv:2510.26020v2 Announce Type: replace-cross Abstract: Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents from outcome-only rewards …

  688. arXiv cs.LG TIER_1 English(EN) · Yikai Wang, Shang Liu, Jose Blanchet ·

    面向人类反馈强化学习的Wasserstein分布鲁棒遗憾优化

    arXiv:2605.00155v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations researc…

  689. arXiv cs.LG TIER_1 English(EN) · Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin ·

    Odysseus:通过强化学习将 VLMs 扩展到游戏中 100 多轮决策

    arXiv:2605.00347v1 Announce Type: new Abstract: Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-…

  690. arXiv cs.LG TIER_1 English(EN) · Haichen Hu, Jian Qian, David Simchi-Levi ·

    基于模型的强化学习在策略优化和离线估计中的双预言机效率

    arXiv:2605.00393v1 Announce Type: new Abstract: Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. Whi…

  691. arXiv cs.LG TIER_1 English(EN) · Tao Li, Kaiyuan Hou, Tuan Vinh, Monika Raj, Zhichun Guo, Carl Yang ·

    使用LLM引导的可合成先导优化动作空间进行强化学习

    arXiv:2604.07669v2 Announce Type: replace Abstract: Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enf…

  692. arXiv cs.LG TIER_1 English(EN) · Preston Rozwood, Edward Mehrez, Ludger Paehler, Wen Sun, Steven L. Brunton ·

    Koopman辅助强化学习

    arXiv:2403.02290v2 Announce Type: replace-cross Abstract: The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman equation, are ubiquitous in reinforcement learning and control theory. However, these equations become intractable for high-dimensional or nonlinear…

  693. arXiv cs.LG TIER_1 English(EN) · Andrzej Ruszczynski, Tiangang Zhang ·

    基于马尔可夫风险度量和多模式风险逼近的强化学习

    arXiv:2605.00654v1 Announce Type: new Abstract: For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the …

  694. arXiv cs.LG TIER_1 English(EN) · Anamika Lochab, Bolian Li, Ruqi Zhang ·

    Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

    arXiv:2605.00365v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collaps…

  695. Hugging Face Daily Papers TIER_1 English(EN) ·

    T$^2$PO:用于稳定多轮Agentic强化学习的不确定性引导探索控制

    Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervas…

  696. arXiv cs.AI TIER_1 English(EN) · Liping Zhang ·

    用于强化学习状态安全性的增强拉格朗日乘子网络

    Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessita…

  697. arXiv cs.AI TIER_1 English(EN) · Jianmin Wu ·

    AEM:多轮智能体强化学习的自适应熵调制

    Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to indi…

  698. arXiv cs.LG TIER_1 English(EN) · David Simchi-Levi ·

    基于模型的强化学习在策略优化和离线估计中的双预言机效率

    Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-…

  699. arXiv cs.AI TIER_1 English(EN) · Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati ·

    FP-IRL:福克-普朗克逆强化学习——一种约束物理学的马尔可夫决策过程方法

    arXiv:2306.10407v3 Announce Type: replace-cross Abstract: Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision pro…

  700. arXiv cs.AI TIER_1 English(EN) · Alexandros Evangelidis, Gricel V\'azquez, Simos Gerasimou ·

    通过分层自适应细化加速大规模MDP中的策略合成

    arXiv:2506.17792v2 Announce Type: replace Abstract: Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional pol…

  701. arXiv cs.AI TIER_1 English(EN) · Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun ·

    解耦推理与置信度:在可验证奖励强化学习中复兴校准

    arXiv:2603.09117v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in inco…

  702. arXiv cs.AI TIER_1 English(EN) · Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp F\"urnstahl, Bernhard Sch\"olkopf, Andreas Krause ·

    有界比率强化学习

    arXiv:2604.18578v3 Announce Type: replace-cross Abstract: Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect betwee…

  703. arXiv cs.LG TIER_1 English(EN) · Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang ·

    协同演化策略蒸馏

    arXiv:2604.27083v1 Announce Type: new Abstract: RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mi…

  704. arXiv cs.AI TIER_1 English(EN) · Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin ·

    PRISM:通过黑盒on-policy蒸馏实现多模态强化学习的预对齐

    arXiv:2604.28123v1 Announce Type: cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distrib…

  705. arXiv cs.LG TIER_1 English(EN) · Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko ·

    贝叶斯策略梯度与Actor-Critic算法

    arXiv:2604.27563v1 Announce Type: new Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, …

  706. arXiv cs.LG TIER_1 English(EN) · Haiyang Zhao ·

    检测很容易,适应很难:基于视觉模型的强化学习在分布变化下的本地专家增长

    arXiv:2604.27411v1 Announce Type: new Abstract: Visual model-based reinforcement learning (MBRL) agents can perform well on the training distribution, but often break down once the test environment shifts. In visual MBRL, recognizing that a shift has occurred is often the easier …

  707. arXiv cs.LG TIER_1 English(EN) · Buqing Ou, Frederike D\"umbgen ·

    表格基础模型能否指导机器人策略学习中的探索?

    arXiv:2604.27667v1 Announce Type: cross Abstract: Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good perfo…

  708. arXiv cs.LG TIER_1 English(EN) · Eason Yu, Tzu Hao Liu, Cl\'ement L. Canonne, Yunke Wang, Chang Xu, Nguyen H. Tran, Stefano V. Albrecht ·

    NashPG:一种具有迭代精炼正则化的策略梯度方法,用于寻找纳什均衡

    arXiv:2510.18183v2 Announce Type: replace Abstract: Finding Nash equilibria in two-player zero-sum imperfect-information games remains a central challenge in multi-agent reinforcement learning. Recent multi-round regularization methods offer a promising direction, yet existing ap…

  709. arXiv cs.AI TIER_1 English(EN) · Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn ·

    EXPO:具有表达性策略的稳定强化学习

    arXiv:2507.07986v3 Announce Type: replace-cross Abstract: We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable …

  710. arXiv cs.CL TIER_1 English(EN) · Chi Jin ·

    Odysseus:通过强化学习将 VLM 扩展到游戏中 100 多轮决策

    Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human traj…

  711. arXiv cs.LG TIER_1 English(EN) · Frederike Dümbgen ·

    表格基础模型能否指导机器人策略学习中的探索?

    Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initializatio…

  712. Hugging Face Daily Papers TIER_1 English(EN) ·

    表格基础模型能否指导机器人策略学习中的探索?

    Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initializatio…

  713. arXiv cs.LG TIER_1 English(EN) · Michal Valko ·

    贝叶斯策略梯度与Actor-Critic算法

    Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, which tend to have high variance, requiring many…

  714. arXiv cs.CL TIER_1 English(EN) · Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson, Kenny Smith ·

    使用强化学习评估递归数制中规则性与可学性之间的关系

    arXiv:2602.21720v2 Announce Type: replace Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learn…

  715. arXiv cs.CL TIER_1 English(EN) · Xia Zeng, Yihan Chen, Luhui Liu, Chao Luo, Ye Chen, Zhuoran Zhuang ·

    教会LLM如何说服他人:异构奖励下的对齐增强策略优化

    arXiv:2510.04214v3 Announce Type: replace Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guar…

  716. arXiv cs.CL TIER_1 English(EN) · Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang ·

    异构自适应策略优化:为每个 Token 的特性量身定制优化

    arXiv:2509.16591v2 Announce Type: replace Abstract: Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regu…

  717. arXiv cs.LG TIER_1 English(EN) · Ankita Kushwaha, Kiran Ravish, Preeti Lamba, Pawan Kumar ·

    安全强化学习与约束马尔可夫决策过程综述:单智能体与多智能体安全技术综述

    arXiv:2505.17342v2 Announce Type: replace Abstract: Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview o…

  718. arXiv cs.AI TIER_1 English(EN) · Seungyub Han, Hyungjin Kim, Jungwoo Lee ·

    Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

    arXiv:2604.26516v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-b…

  719. arXiv cs.LG TIER_1 English(EN) · Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan ·

    离线强化学习策略约束的自适应缩放

    arXiv:2508.19900v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered…

  720. arXiv cs.AI TIER_1 English(EN) · Jungwoo Lee ·

    Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

    Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation i…

  721. arXiv cs.LG TIER_1 English(EN) · Ihor Vitenko, Noha Ibrahim, Sihem Amer-Yahia ·

    Lever: 支持约束下的推理时策略重用

    arXiv:2604.20174v2 Announce Type: replace Abstract: Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new compo…

  722. arXiv cs.LG TIER_1 English(EN) · Alexandru Cioba, Aya Kayal, Laura Toni, Sattar Vakili, Alberto Bernacchia ·

    使用已知不变性的强化学习

    arXiv:2511.03473v2 Announce Type: replace Abstract: In many real-world reinforcement learning (RL) problems, the environment exhibits inherent symmetries that can be exploited to improve learning efficiency. This paper develops a theoretical and algorithmic framework for incorpor…

  723. arXiv cs.LG TIER_1 English(EN) · Artur Eisele, Bernd Frauenknecht, Friedrich Solowjow, Sebastian Trimpe ·

    Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

    arXiv:2604.25508v1 Announce Type: new Abstract: Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dy…

  724. arXiv cs.LG TIER_1 English(EN) · Dominik \.Zurek, Kamil Faber, Marcin Pietron, Pawe{\l} Gajewski, Roberto Corizzo ·

    TSN-Affinity: 持续离线强化学习的驱动参数重用

    arXiv:2604.25898v1 Announce Type: new Abstract: Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise …

  725. arXiv cs.LG TIER_1 English(EN) · Ali Al Housseini, Cristina Rottondi, Omran Ayoub ·

    面向含替代方案的动态VNE问题的分层强化学习

    arXiv:2512.05207v2 Announce Type: replace-cross Abstract: Virtual Network Embedding (VNE) is a key enabler of network slicing, yet most formulations assume that each Virtual Network Request (VNR) has a fixed topology. Recently, VNE with Alternative topologies (VNEAP) was introduc…

  726. arXiv cs.LG TIER_1 English(EN) · Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban ·

    策略改进强化学习

    arXiv:2604.00860v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize p…

  727. arXiv cs.AI TIER_1 English(EN) · Roberto Corizzo ·

    TSN-Affinity:面向持续离线强化学习的驱动参数重用相似性

    Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live enviro…

  728. Hugging Face Daily Papers TIER_1 English(EN) ·

    TSN-Affinity: 驱动参数重用的相似性用于持续离线强化学习

    Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live enviro…

  729. arXiv cs.AI TIER_1 English(EN) · Daniele Meli ·

    Sample-efficient Neuro-symbolic Proximal Policy Optimization

    Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers pa…

  730. arXiv cs.LG TIER_1 English(EN) · Sebastian Trimpe ·

    Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

    Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dynamics. We propose Dyna-style Safety Augmented R…

  731. arXiv cs.AI TIER_1 English(EN) · Karol Desnos ·

    用于连续控制多任务强化学习的多动作纠缠程序图

    Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (…

  732. arXiv cs.LG TIER_1 English(EN) · Zijian Guo, \.Ilker I\c{s}{\i}k, H. M. Sabbir Ahmad, Wenchao Li ·

    SpecRLBench:用于规范引导强化学习中泛化能力的基准测试

    arXiv:2604.24729v1 Announce Type: new Abstract: Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promis…

  733. arXiv cs.CL TIER_1 English(EN) · Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li ·

    带有回溯反馈的强化学习

    arXiv:2602.08377v2 Announce Type: replace-cross Abstract: Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). Th…

  734. arXiv cs.LG TIER_1 English(EN) · Stela Tong, Elai Ben-Gal ·

    CoFi-PGMA:多智能体大语言模型在过滤反馈下的反事实策略梯度

    arXiv:2604.22785v1 Announce Type: new Abstract: Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal…

  735. arXiv cs.LG TIER_1 English(EN) · Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, Ozlem Garibay, Niloofar Yousefi ·

    当策略无法再训练:离线强化学习中训练后引导的统一闭式解视角

    arXiv:2604.22873v1 Announce Type: new Abstract: Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or gove…

  736. arXiv cs.LG TIER_1 English(EN) · Zixuan Xia, Quanxi Li ·

    K-Score:卡尔曼滤波器作为强化学习中奖励归一化的原则性替代方案

    arXiv:2604.23056v1 Announce Type: new Abstract: We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recur…

  737. arXiv cs.LG TIER_1 English(EN) · Rahul Narava, Siddharth Verma, Ojas Jain, Shashi Shekhar Jha, Mayank Shekhar Jha ·

    CAPSULE:安全不确定性感知强化学习的控制理论动作扰动

    arXiv:2604.23576v1 Announce Type: new Abstract: Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to s…

  738. arXiv cs.LG TIER_1 English(EN) · Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng ·

    TCOD:在多轮自主代理的在线策略蒸馏中探索时间课程

    arXiv:2604.24005v1 Announce Type: new Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent se…

  739. arXiv cs.LG TIER_1 English(EN) · Atahan Cilan, Mahir Demir, \"Ozg\"un Can Y\"ur\"utken, Seyyid Osman Sevgili, \"Umit Can Bekar ·

    利用强化学习完善飞机机动动作

    arXiv:2604.24338v1 Announce Type: new Abstract: This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A m…

  740. arXiv cs.LG TIER_1 English(EN) · Ying-Tu Chen, Wei Hung, Bing-Shu Wu, Zhang-Wei Hong, Ping-Chun Hsieh ·

    关于多目标强化学习的无奖励视角

    arXiv:2604.24532v1 Announce Type: new Abstract: Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} ad…

  741. arXiv cs.CL TIER_1 English(EN) · Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao ·

    DPEPO:用于 LLM 驱动的代理的多样化并行探索策略优化

    arXiv:2604.24320v1 Announce Type: new Abstract: Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental und…

  742. arXiv cs.LG TIER_1 English(EN) · Shipeng Li, Zhiqin Yang, Shikun Li, Xiaobo Xia, Hengyu Liu, Xinghua Zhang, Gaode Chen, Dong Fang, Ying Tai, Zhe Peng ·

    LearnAlign:用于 LLM 强化学习的数据选择,具有改进的梯度对齐

    arXiv:2506.11480v4 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we p…

  743. arXiv cs.LG TIER_1 English(EN) · Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh ·

    用于强化学习的多色目标

    arXiv:2509.25424v5 Announce Type: replace Abstract: Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising b…

  744. arXiv cs.AI TIER_1 English(EN) · Donghwan Lee ·

    超越贝尔曼不动点:价值迭代中的几何与快速策略识别

    arXiv:2604.17457v3 Announce Type: replace-cross Abstract: Q-value iteration (Q-VI) is usually analyzed through the \(\gamma\)-contraction of the Bellman operator. This argument proves convergence to \(Q^*\), but it gives only a coarse account of when the induced greedy policy bec…

  745. arXiv cs.LG TIER_1 English(EN) · Wenchao Li ·

    SpecRLBench:用于规范引导强化学习中泛化能力的基准测试

    Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across …

  746. arXiv cs.LG TIER_1 English(EN) · Ping-Chun Hsieh ·

    关于多目标强化学习的无奖励视角

    Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} addresses this by training a single policy network…

  747. Hugging Face Daily Papers TIER_1 English(EN) ·

    利用强化学习完善飞机机动动作

    This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulat…

  748. arXiv cs.LG TIER_1 English(EN) · Ümit Can Bekar ·

    利用强化学习完善飞机机动

    This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulat…

  749. arXiv cs.CL TIER_1 English(EN) · Shen Gao ·

    DPEPO:用于基于LLM的代理的多样化并行探索策略优化

    Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single …

  750. arXiv cs.CL TIER_1 English(EN) · Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu ·

    UR$^2$:通过强化学习统一RAG和推理

    arXiv:2508.06165v4 Announce Type: replace Abstract: Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex …

  751. arXiv cs.LG TIER_1 English(EN) · Anne E. Staples ·

    受昆虫启发的模块化架构作为强化学习的归纳偏置

    arXiv:2604.22081v1 Announce Type: new Abstract: Most reinforcement-learning (RL) controllers used in continuous control are architecturally centralized: observations are compressed into a single latent state from which both value estimates and actions are produced. Biological con…

  752. arXiv cs.LG TIER_1 English(EN) · Peiyan Zhang, Hanmo Liu, Chengxuan Tong, Yuxia Wu, Wei Guo, Yong Liu ·

    ReCast:为生成式推荐中的强化学习重塑学习信号

    arXiv:2604.22169v1 Announce Type: new Abstract: Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at al…

  753. arXiv cs.LG TIER_1 English(EN) · Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang ·

    保留支持,而非通信:离线强化学习的动态路由

    arXiv:2604.22229v1 Announce Type: new Abstract: One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset…

  754. arXiv cs.LG TIER_1 English(EN) · Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng ·

    SOLAR-RL:半在线长时域分配强化学习

    arXiv:2604.22558v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GU…

  755. arXiv cs.LG TIER_1 English(EN) · Rashmeet Kaur Nayyar, Naman Shah, Siddharth Srivastava ·

    Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

    arXiv:2512.20831v2 Announce Type: replace-cross Abstract: Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed.…

  756. arXiv cs.LG TIER_1 English(EN) · Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor ·

    AdaFair-MARL:在多智能体强化学习中强制执行自适应公平性约束

    arXiv:2511.14135v2 Announce Type: replace Abstract: Fair workload enforcement in heterogeneous multi-agent systems that pursue shared objectives remains challenging. Fixed fairness penalties often introduce inefficiencies, training instability, and conflicting agent incentives. R…

  757. arXiv cs.AI TIER_1 English(EN) · Lingfang Zeng ·

    SOLAR-RL:半在线长时域分配强化学习

    As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilem…

  758. arXiv cs.AI TIER_1 English(EN) · Chi Zhang ·

    保留支持,而非通信:离线强化学习的动态路由

    One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipe…

  759. arXiv cs.AI TIER_1 English(EN) · Yong Liu ·

    ReCast:为生成式推荐中的强化学习重塑学习信号

    Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast lea…

  760. arXiv cs.LG TIER_1 English(EN) · Sukesh Subaharan ·

    Dynamical Priors as a Training Objective in Reinforcement Learning

    Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or …

  761. Hugging Face Daily Papers TIER_1 English(EN) ·

    过于正确而无法学习:饱和推理数据的强化学习

    Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in grou…

  762. X — Mira Murati TIER_1 English(EN) · Mira Murati ·

    结合RL和SFT的优势,通过on-policy蒸馏,一种用于训练小型模型以实现领域性能和持续学习的有前景的方法...

    Combining the benefits of RL and SFT with on-policy distillation, a promising approach for training small models for domain performance and continual learning.<div class="rsshub-quote"><br /><br />Thinking Machines: Our latest post explores on-policy distillation, a training appr…

  763. arXiv stat.ML TIER_1 English(EN) · Zhiheng Zhang ·

    Wasserstein策略学习用于分布结果

    Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare defined as the mean of scalar-valued potential out…

  764. arXiv stat.ML TIER_1 English(EN) · Tengyang Xie ·

    轨迹级监督何时能实现高效的离线强化学习?

    Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first …

  765. arXiv cs.CV TIER_1 English(EN) · Mohamed Jismy Aashik Rasool, Shabir Ahmad, Gisong Oh, Teag Kuen Whangbo ·

    SPARK: 基于空间策略的自适应强化学习知识蒸馏

    arXiv:2606.15243v1 Announce Type: new Abstract: Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Exist…

  766. arXiv cs.CV TIER_1 English(EN) · Shaivi Malik ·

    用于神经模型编辑的强化学习

    arXiv:2606.13461v1 Announce Type: cross Abstract: Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formula…

  767. arXiv stat.ML TIER_1 English(EN) · Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin ·

    在计算受限的世界中,借助模拟器学习:无悔

    arXiv:2606.13576v1 Announce Type: cross Abstract: Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, wh…

  768. arXiv cs.CV TIER_1 English(EN) · Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou ·

    从“看见”到“体验”:用强化学习扩展导航基础模型

    arXiv:2507.22028v2 Announce Type: replace Abstract: Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to …

  769. arXiv stat.ML TIER_1 English(EN) · Alexander Rakhlin ·

    在计算受限的世界中,通过模拟器学习:无悔

    Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far mo…

  770. arXiv stat.ML TIER_1 English(EN) · Tommaso Giorgi, Pierriccardo Olivieri, Keyue Jiang, Laura Toni, Matteo Papini ·

    连通性对强化学习中拉普拉斯表示的影响

    arXiv:2603.08558v3 Announce Type: replace-cross Abstract: Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches l…

  771. arXiv cs.CV TIER_1 English(EN) · Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong ·

    ReMoT: 基于运动对比三元组的强化学习

    arXiv:2603.00461v3 Announce Type: replace Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integ…

  772. arXiv stat.ML TIER_1 English(EN) · Alexander Ryabchenko, Wenlong Mou ·

    基于动作触发观测的强化学习

    arXiv:2510.02149v2 Announce Type: replace-cross Abstract: We introduce Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, w…

  773. arXiv cs.CV TIER_1 English(EN) · Guillaume Henon-Just ·

    面向二维不规则排样中的几何感知强化学习

    Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforc…

  774. arXiv stat.ML TIER_1 English(EN) · Haolin Liu, Braham Snyder, Chen-Yu Wei ·

    具有 $Q^\star$ 近似和部分覆盖的离线强化学习的复杂性

    arXiv:2602.12107v2 Announce Type: replace-cross Abstract: We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited…

  775. arXiv stat.ML TIER_1 English(EN) · Xiaofeng Lin, Seungbae Kim, Zhuoya Li, Zachary DeSoto, Charles Fleming, Guang Cheng ·

    ReTabSyn:通过强化学习实现逼真的表格数据合成

    arXiv:2603.10823v2 Announce Type: replace Abstract: Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving…

  776. arXiv stat.ML TIER_1 English(EN) · Thanh Nguyen-Tang, Raman Arora ·

    Exact Unlearning in Reinforcement Learning

    arXiv:2606.04182v1 Announce Type: cross Abstract: We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output…

  777. arXiv stat.ML TIER_1 English(EN) · Harin Lee, Kevin Jamieson ·

    在线强化学习中具有延迟观测的Minimax最优策略

    arXiv:2603.03480v2 Announce Type: replace-cross Abstract: We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper…

  778. arXiv stat.ML TIER_1 English(EN) · Raman Arora ·

    Exact Unlearning in Reinforcement Learning

    We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output after unlearning is \emph{indistinguishable} from…

  779. arXiv stat.ML TIER_1 English(EN) · Raman Arora ·

    Minimax-Optimal Policy Regret in Partially Observable Markov Games

    arXiv:2606.02363v1 Announce Type: cross Abstract: We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial o…

  780. arXiv stat.ML TIER_1 English(EN) · Imad Aouali, Otmane Sakhi ·

    大动作空间中的离策略学习:优化比估计更重要

    arXiv:2509.03456v2 Announce Type: replace Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assumin…

  781. arXiv stat.ML TIER_1 English(EN) · Volodymyr Tkachuk, Csaba Szepesv\'ari, Xiaoqi Tan ·

    固定时限离线强化学习中,轨迹数据足以实现具有线性 $q^\pi$-可实现性和可集中性的统计高效策略评估

    arXiv:2510.03494v2 Announce Type: replace-cross Abstract: We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for eit…

  782. arXiv stat.ML TIER_1 English(EN) · Raman Arora ·

    Minimax-Optimal Policy Regret in Partially Observable Markov Games

    We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavi…

  783. arXiv stat.ML TIER_1 English(EN) · Yike Zhao, Onno Eberhard, Malek Khammassi, Ali H. Sayed, Michael Muehlebach ·

    为什么线性循环记忆在部分可观察强化学习中有效

    arXiv:2605.31261v1 Announce Type: cross Abstract: The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by cons…

  784. arXiv stat.ML TIER_1 English(EN) · Vittorio Giammarino, Anastasios Manganaris, Ahmed H. Qureshi ·

    物理信息目标条件强化学习在混合接触动力学中的应用

    arXiv:2605.30503v1 Announce Type: cross Abstract: Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies tha…

  785. arXiv stat.ML TIER_1 English(EN) · Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang ·

    两时间尺度马尔可夫随机逼近的收敛及其在强化学习中的应用

    arXiv:2605.31172v1 Announce Type: cross Abstract: This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA i…

  786. arXiv stat.ML TIER_1 English(EN) · Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata ·

    PAC-贝叶斯强化学习训练可泛化策略

    arXiv:2510.10544v3 Announce Type: replace-cross Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obt…

  787. LessWrong (AI tag) TIER_1 English(EN) · andyqhan ·

    你好吗?强化学习在语言模型中招募了一个功能性福利轴

    <p><i><span>In collaboration with David Chalmers and Pavel Izmailov. Work done at NYU. Andy wrote this summary of the paper, which you can find in full on the </span></i><a href="https://functionalwelfare.com" rel="noreferrer"><i><span>website</span></i></a><i><span>, or, if you …

  788. arXiv stat.ML TIER_1 English(EN) · Michael Muehlebach ·

    为什么线性循环记忆在部分可观察强化学习中有效

    The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the …

  789. arXiv stat.ML TIER_1 English(EN) · Shangtong Zhang ·

    具有强化学习应用的双时间尺度马尔可夫随机逼近的收敛性

    This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal dif…

  790. arXiv stat.ML TIER_1 English(EN) · Christoph Dann, Yishay Mansour, Mehryar Mohri ·

    策略感知模拟器学习的理论基础与有效算法

    arXiv:2605.29032v1 Announce Type: cross Abstract: Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a real…

  791. arXiv stat.ML TIER_1 English(EN) · Dorival Le\~ao, Alberto Ohashi, Simone Scotti, Adolfo M. D da Silva ·

    通过模型外训练和重要性采样实现自适应学习,用于完全非马尔可夫最优随机控制。完整版

    arXiv:2604.13147v2 Announce Type: replace Abstract: This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equat…

  792. arXiv stat.ML TIER_1 English(EN) · Ahmed H. Qureshi ·

    物理信息目标条件强化学习在混合接触动力学中的应用

    Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization…

  793. arXiv stat.ML TIER_1 English(EN) · Wonyoung Kim, Min-Hwan Oh, Garud Iyengar, Assaf Zeevi ·

    多项Logit函数逼近强化学习的方差自适应最优算法

    arXiv:2605.28364v1 Announce Type: new Abstract: Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-ca…

  794. arXiv stat.ML TIER_1 English(EN) · Sebastian Sanokowski, Kaustubh Patil ·

    用于最大熵强化学习的扩散增强马尔可夫决策过程

    arXiv:2512.02019v3 Announce Type: replace-cross Abstract: Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajecto…

  795. arXiv stat.ML TIER_1 English(EN) · Guang-Yuan Hao, Lars van der Laan, Aur\'elien Bibaut, Nathan Kallus ·

    逆强化学习的奖励转移:一种耦合的极大极小方法

    arXiv:2605.27834v1 Announce Type: cross Abstract: We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are c…

  796. arXiv stat.ML TIER_1 English(EN) · Mohammadmahdi Ghasemloo, David J. Eckman, Yaxian Li ·

    使用模拟代理模型加速强化学习训练

    arXiv:2605.27556v1 Announce Type: new Abstract: High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model's input-output rela…

  797. arXiv stat.ML TIER_1 English(EN) · Mehryar Mohri ·

    策略感知模拟器学习的理论基础与有效算法

    Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but f…

  798. arXiv stat.ML TIER_1 English(EN) · Assaf Zeevi ·

    用于具有多项Logit函数逼近的强化学习的方差自适应最优算法

    Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance…

  799. arXiv stat.ML TIER_1 English(EN) · Shengbo Wang, Jose Blanchet, Peter Glynn ·

    Learning Stochastic Optimal Control 的策略遗憾的快速收敛

    arXiv:2605.26361v1 Announce Type: cross Abstract: Policy learning in modern operations environments faces a fundamental tension between limited operational data and the large, often continuous, state and action spaces over which good decisions must be identified and deployed. We …

  800. arXiv stat.ML TIER_1 English(EN) · Nathan Kallus ·

    逆强化学习的奖励转移:一种耦合的极大极小方法

    We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate…

  801. arXiv stat.ML TIER_1 English(EN) · Yaxian Li ·

    使用模拟代理模型加速强化学习训练

    High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model's input-output relationship. In parallel, reinforcement learning (R…

  802. arXiv stat.ML TIER_1 English(EN) · Peter Glynn ·

    Learning Stochastic Optimal Control 的策略遗憾的快速收敛

    Policy learning in modern operations environments faces a fundamental tension between limited operational data and the large, often continuous, state and action spaces over which good decisions must be identified and deployed. We study value-based policy learning in stochastic op…

  803. arXiv stat.ML TIER_1 English(EN) · Chengchun Shi ·

    反事实安全强化学习

    Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the noti…

  804. arXiv stat.ML TIER_1 English(EN) · Taiji Suzuki ·

    神经奖励模型如何学习策略优化特征:单指标分析

    Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with…

  805. arXiv cs.CV TIER_1 English(EN) · Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing ·

    ParaVT:解决代理视频强化学习中并行工具使用的工具先验悖论

    arXiv:2605.20342v2 Announce Type: replace Abstract: Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dis…

  806. arXiv stat.ML TIER_1 English(EN) · Jongchan Park ·

    通过自适应批次缩放实现可扩展的在线策略强化学习

    arXiv:2605.21557v1 Announce Type: new Abstract: Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation d…

  807. arXiv stat.ML TIER_1 English(EN) · Oliver Mortensen, Mohammad Sadegh Talebi ·

    关于带优化确定性等价物的折扣强化学习的样本复杂度

    arXiv:2605.21763v1 Announce Type: cross Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which…

  808. arXiv stat.ML TIER_1 English(EN) · Mohammad Sadegh Talebi ·

    关于具有优化确定性等价物的折扣强化学习的样本复杂度

    We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic…

  809. arXiv stat.ML TIER_1 English(EN) · Jongchan Park ·

    通过自适应批次缩放实现可扩展的在线策略强化学习

    Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data …

  810. arXiv stat.ML TIER_1 English(EN) · Zijun Chen, Zihan Zhang ·

    Contextual Action-Set Reinforcement Learning 的更严格遗憾界限

    arXiv:2605.15692v1 Announce Type: cross Abstract: We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret ag…

  811. arXiv stat.ML TIER_1 English(EN) · Maryam Kamgarpour ·

    Inverse Reinforcement Learning 的快速收敛

    We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimat…

  812. arXiv stat.ML TIER_1 English(EN) · Ian Osband ·

    令人愉悦的分布式策略梯度

    arXiv:2603.20521v2 Announce Type: replace-cross Abstract: Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising …

  813. arXiv stat.ML TIER_1 English(EN) · Tobias Schm\"ahling, Matthias Burkhardt, Tobias Windisch ·

    面向离线强化学习的轨迹级数据增强

    arXiv:2605.13401v1 Announce Type: cross Abstract: We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajector…

  814. arXiv stat.ML TIER_1 English(EN) · Yash Kanoria ·

    通过混合梯度在混合离散-连续动作空间中进行策略优化

    We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard mo…

  815. arXiv stat.ML TIER_1 English(EN) · Tobias Windisch ·

    面向离线强化学习的轨迹级数据增强

    We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation …

  816. arXiv stat.ML TIER_1 English(EN) · Maxime Haddouche, Otmane Sakhi ·

    Sequential Off-Policy Learning with Logarithmic Smoothing

    arXiv:2506.10664v2 Announce Type: replace Abstract: Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, polici…

  817. arXiv stat.ML TIER_1 English(EN) · Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli ·

    DARLING:具有非平稳保证的检测增强强化学习

    arXiv:2604.16684v2 Announce Type: replace-cross Abstract: We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise stationary (PS) setting,…

  818. arXiv stat.ML TIER_1 English(EN) · Nam Phuong Tran, Andi Nika, Goran Radanovic, Long Tran-Thanh, Debmalya Mandal ·

    稀疏离线强化学习与腐蚀鲁棒性

    arXiv:2512.24768v3 Announce Type: replace Abstract: We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse …

  819. arXiv stat.ML TIER_1 English(EN) · Aidan Gleich, Eric Laber, Alexander Volfovsky ·

    未知网络干扰下的自适应策略学习

    arXiv:2605.11191v1 Announce Type: new Abstract: Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in orde…

  820. arXiv stat.ML TIER_1 English(EN) · Seokmin Ko, Ambuj Tewari, Kihyuk Hong ·

    部分数据覆盖下的离线约束强化学习

    arXiv:2505.17506v2 Announce Type: replace Abstract: We study offline constrained reinforcement learning with general function approximation in discounted constrained Markov decision processes. Prior methods either require full data coverage for evaluating intermediate policies, l…

  821. arXiv stat.ML TIER_1 English(EN) · Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao ·

    TOPPO:为多任务强化学习重新思考PPO,并进行Critic Balancing

    arXiv:2605.11473v1 Announce Type: cross Abstract: Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diag…

  822. arXiv stat.ML TIER_1 English(EN) · Rui Miao ·

    TOPPO:为多任务强化学习重新思考PPO并进行Critic平衡

    Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously ov…

  823. arXiv stat.ML TIER_1 English(EN) · Alexander Volfovsky ·

    未知网络干扰下的自适应策略学习

    Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in order to maximize a cumulative outcome of interest (…

  824. arXiv stat.ML TIER_1 English(EN) · Guannan Qu ·

    重新审视受限策略类策略梯度:使用 $k$-步策略梯度逃离近视局部最优

    This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improv…

  825. arXiv stat.ML TIER_1 English(EN) · Zaiwei Chen ·

    自然策略梯度作为双重平滑策略迭代:一个贝尔曼算子框架

    In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in whi…

  826. arXiv stat.ML TIER_1 English(EN) · Lars van der Laan, Nathan Kallus, Aurelien Bibaut ·

    基于分类和少量回归的逆强化学习

    arXiv:2509.21172v2 Announce Type: replace-cross Abstract: Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery…

  827. arXiv stat.ML TIER_1 English(EN) · Xinyu Liu, Zixuan Xie, Shangtong Zhang ·

    通过泊松-莫罗漂移实现随机逼近和强化学习的几乎确定收敛率

    arXiv:2605.07104v1 Announce Type: cross Abstract: Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic app…

  828. arXiv stat.ML TIER_1 English(EN) · Yuyang Zhang, Haldun Balim, Na Li ·

    用于合作多智能体强化学习中增强探索的去中心化扩散策略学习

    arXiv:2605.07101v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), address…

  829. arXiv stat.ML TIER_1 English(EN) · Kun Long, Yuqiang Li, Xianyi Wu ·

    基于平滑核的改进模型强化学习

    arXiv:2605.07218v1 Announce Type: cross Abstract: For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive …

  830. arXiv stat.ML TIER_1 English(EN) · Lars van der Laan, Nathan Kallus ·

    离线强化学习中$V$-学习的Bellman校准

    arXiv:2512.23694v2 Announce Type: replace Abstract: Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Be…

  831. LessWrong (AI tag) TIER_1 English(EN) · Oliver Sourbut ·

    强化学习的规模化可能激励人工智能的隐藏推理架构

    <p><span>In short: the </span><i><span>transformer</span></i><span> architecture brought massive scale to AI, and </span><i><span>also</span></i><span> provided partial guarantees of ‘reasoning out loud’, an unprecedentedly interpretable situation for AI. Reinforcement learning (…

  832. arXiv stat.ML TIER_1 English(EN) · Feng Ji ·

    强化学习测量模型

    Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action val…

  833. arXiv stat.ML TIER_1 English(EN) · Xianyi Wu ·

    基于平滑核的改进模型强化学习

    For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-bas…

  834. arXiv stat.ML TIER_1 English(EN) · Shangtong Zhang ·

    通过泊松-莫罗漂移实现随机逼近和强化学习的几乎确定收敛率

    Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are c…

  835. arXiv stat.ML TIER_1 English(EN) · Na Li ·

    用于合作多智能体强化学习中增强探索的去中心化扩散策略学习

    Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In pr…

  836. arXiv stat.ML TIER_1 English(EN) · Lifeng Lai ·

    Transformer 模型可证明地通过策略改进实现上下文强化学习

    We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement p…

  837. arXiv stat.ML TIER_1 English(EN) · Li Song ·

    固定预算下最大化推广信息量:工具使用代理强化学习的子模态树搜索视角

    We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bo…

  838. arXiv stat.ML TIER_1 English(EN) · Onno Eberhard, Thibaut Cuvelier, Michal Valko, Bruno De Backer ·

    以目标条件强化学习视角看中程物流

    arXiv:2605.02461v1 Announce Type: new Abstract: Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with …

  839. arXiv stat.ML TIER_1 English(EN) · Bruno De Backer ·

    以目标条件强化学习的视角看中程物流

    Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs f…

  840. arXiv stat.ML TIER_1 English(EN) · Tiangang Zhang ·

    基于马尔可夫风险度量和多模式风险近似的强化学习

    For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in…

  841. arXiv stat.ML TIER_1 English(EN) · Rohan Tangri, Jan-Peter Calliess ·

    基于Cantelli界限的条件风险价值的约束策略优化

    arXiv:2601.22993v3 Announce Type: replace-cross Abstract: We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empi…

  842. arXiv stat.ML TIER_1 English(EN) · Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang ·

    DDO-RM:奖励学习后的分布级策略改进

    arXiv:2604.11119v2 Announce Type: replace Abstract: Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate deci…

  843. arXiv stat.ML TIER_1 English(EN) · Ruqi Zhang ·

    Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

    Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degra…

  844. arXiv stat.ML TIER_1 English(EN) · Jose Blanchet ·

    面向人类反馈强化学习的Wasserstein分布鲁棒遗憾优化

    Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem u…

  845. arXiv cs.CV TIER_1 English(EN) · Chengwei Qin ·

    PRISM:通过黑盒on-policy蒸馏实现多模态强化学习的预对齐

    The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's o…

  846. arXiv stat.ML TIER_1 English(EN) · Zhenghao Li, Shengbo Wang, Nian Si ·

    基于散度约束的S-矩形分布鲁棒强化学习的近最优样本复杂度

    arXiv:2505.12202v3 Announce Type: replace-cross Abstract: Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conse…

  847. arXiv stat.ML TIER_1 English(EN) · Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin ·

    当错误有益时:策略梯度不完美奖励的分类

    arXiv:2604.25872v1 Announce Type: cross Abstract: Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality o…

  848. arXiv stat.ML TIER_1 English(EN) · Noam Razin ·

    错误有时也有益:策略梯度不完美奖励的分类

    Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat i…

  849. arXiv stat.ML TIER_1 English(EN) · Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong ·

    CODA:用于多智能体离线强化学习的基于策略的扩散协调

    arXiv:2604.23308v1 Announce Type: cross Abstract: Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they ca…

  850. arXiv stat.ML TIER_1 English(EN) · Elliot Fosong ·

    CODA:用于多智能体离线强化学习的基于策略的扩散协调

    Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introdu…

  851. Smol AINews TIER_1 English(EN) ·

    Prime Intellect的INTELLECT-2和PRIME-RL推进分布式强化学习

    **Prime Intellect** released **INTELLECT-2**, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. **ByteDance** launched **DreamO**, a unified image customization model on Hugging Face. **Qwen** released models opt…

  852. Smol AINews TIER_1 English(EN) ·

    PRIME:通过隐式奖励进行过程强化

    **Implicit Process Reward Models (PRIME)** have been highlighted as a significant advancement in online reinforcement learning, trained on a **7B model** with impressive results compared to **gpt-4o**. The approach builds on the importance of process reward models established by …

  853. Eugene Yan TIER_1 English(EN) ·

    强化学习在推荐和搜索中的应用

    Focusing on long-term rewards, exploration, and frequently updated item.

  854. Modal blog TIER_1 English(EN) ·

    强化学习是一个基础设施问题

    What we've seen helping teams run Reinforcement Learning at scale on Modal. Plus an open-source library to skip the scaffolding.

  855. Modal blog TIER_1 English(EN) ·

    在应用计算中扩展强化学习

    How Applied Compute trains custom agents with Reinforcement Learning for enterprises like DoorDash, Cognition, and Mercor on Modal.

  856. AWS Machine Learning Blog TIER_1 English(EN) · Surya Kari ·

    克服奖励信号挑战:在SageMaker AI上使用GRPO实现可验证的基于奖励的强化学习

    In this post, you will learn how to implement reinforcement learning with verifiable rewards (RLVR) to introduce verification and transparency into reward signals to improve training performance. This approach works best when outputs can be objectively verified for correctness, s…

  857. Together AI blog TIER_1 English(EN) ·

    Together AI 与 Meta 合作,将 PyTorch 强化学习引入 AI Native Cloud

    Build, train, and deploy advanced AI agents with integrated reinforcement learning on the Together platform.

  858. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    探索深度强化学习

    <p>In addition to being a Developer Advocate at Hugging Face, Thomas Simonini is building next-gen AI in games that can talk and have smart interactions with the player using Deep Reinforcement Learning (DRL) and Natural Language Processing (NLP). He also created a Deep Reinforce…

  859. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    强化学习用于搜索

    <p>Hamish from Sajari blows our mind with a great discussion about AI in search. In particular, he talks about Sajari’s quest for performant AI implementations and extensive use of Reinforcement Learning (RL). We’ve been wanting to make this one happen for a while, and it was wel…

  860. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    用于芯片设计的强化学习

    <p>Daniel and Chris have a fascinating discussion with Anna Goldie and Azalia Mirhoseini from Google Brain about the use of reinforcement learning for chip floor planning - or placement - in which many new designs are generated, and then evaluated, to find an optimal component la…

  861. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    深度强化学习

    <p>While attending the NVIDIA GPU Technology Conference in Silicon Valley, Chris met up with Adam Stooke, a speaker and PhD student at UC Berkeley who is doing groundbreaking work in large-scale deep reinforcement learning and robotics. Adam took Chris on a tour of deep reinforce…

  862. Lex Fridman Podcast TIER_1 English(EN) · Lex Fridman ·

    Leslie Kaelbling:强化学习、规划与机器人学

    <p>Leslie Kaelbling is a roboticist and professor at MIT. She is recognized for her work in reinforcement learning, planning, robot navigation, and several other topics in AI. She won the IJCAI Computers and Thought Award and was the editor-in-chief of the prestigious Journal of …

  863. Lex Fridman Podcast TIER_1 Nederlands(NL) · Lex Fridman ·

    Pieter Abbeel: 深度强化学习

    <p>Pieter Abbeel is a professor at UC Berkeley, director of the Berkeley Robot Learning Lab, and is one of the top researchers in the world working on how to make robots understand and interact with the world around them, especially through imitation and deep reinforcement learni…

  864. Medium — Claude tag TIER_1 English(EN) · Thirupathi Pavan Sai ·

    机器如何学习:监督式、无监督式与强化学习

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@thirupathipavansai/how-machines-learn-supervised-unsupervised-reinforcement-learning-2f8a5ae8961d?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/2600/0*EWb6ZOvetQIJ8rw…

  865. Medium — Claude tag TIER_1 English(EN) · Abhishekrout ·

    ChatGPT 苦学 + 核心循环(RL-强化学习)背后的秘密

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@abhishekrout77/the-secret-behind-chatgpt-bitter-lesson-core-loop-rl-reinforcement-learning-40cf97d7104c?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/1536/1*nnTWzfO_n…

  866. Towards AI TIER_1 English(EN) · Deepanshu Gupta ·

    强化学习:推理模型背后的训练后引擎

    <h4>Reinforcement learning used to feel like a branch of AI reserved for games, robotics, recommendation systems, and control.</h4><p>It was the world of agents, environments, rewards, policies, simulators, self-play, exploration, and long-horizon decisions. The defining question…

  867. Mastodon — mastodon.social TIER_1 日本語(JA) · ymbot ·

    深入解析 GPT-OSS 中的 Agentic 强化学习:实践回顾 https:// huggingface.co/blog/LinkedIn/g pt-oss-agentic-rl *AI生成自动发布 (标题+链接) # AI # GenerativeAI # LLM # AIGenerated

    【GPT-OSSにおけるエージェント型強化学習の解明:実践的な回顧】 https:// huggingface.co/blog/LinkedIn/g pt-oss-agentic-rl ※AI生成の自動投稿(見出し+リンク) # AI # 生成AI # LLM # AIGenerated

  868. Mastodon — mastodon.social TIER_1 English(EN) · jonathannnnn ·

    深入了解强化学习如何导致“奖励破解”,即人工智能在未真正实现预期目标的情况下找到最大化奖励的捷径。它

    A look at how reinforcement learning can lead to “reward hacking,” where AI finds shortcuts to maximize rewards without truly achieving the intended goal. It highlights how reward design shapes AI behavior. # AI # MachineLearning # AIsafety Read more: https:// solihullpublishing.…

  869. Mastodon — mastodon.social TIER_1 English(EN) · aihaberleri ·

    📰 2026年突破:OpenAI使用Python脚本在强化学习中消除参数更新 一种开创性的强化学习范式得到开发

    📰 2026 Breakthrough: OpenAI Eliminates Parameter Updates in Reinforcement Learning with Python Scripts A groundbreaking reinforcement learning paradigm developed by OpenAI researcher Jia-Yi Weng eliminates the need for parameter updates, enabling AI agents to make decisions by ge…

  870. Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri ·

    📰 新学习方法:无参数强化学习 OpenAI 研究人员,AI 自主决策无需更新参数

    📰 Yeni Öğrenme Yöntemi: Parametre Güncellemesiz Reinforcement Learning OpenAI araştırmacıları, parametreleri güncellemeden yapay zekanın kendi kendine karar vermesini sağlayan yeni bir reinforcement learning范式 sundu. Bu yöntem, AI'nin bir .py dosyası yazarak öğrenmesini sağlıyor.…