PulseAugur
EN
LIVE 17:01:01

OpenAI advances reinforcement learning with Dota 2, safety, and generalization

OpenAI has published a series of research papers detailing advancements in reinforcement learning. These include achieving superhuman performance in Dota 2 with OpenAI Five, developing benchmarks for safe exploration in RL, and quantifying generalization capabilities with the CoinRun environment. The company also explored novel methods like prediction-based rewards for curiosity-driven exploration, learning policy representations in multiagent systems, and an experimental metalearning approach called Evolved Policy Gradients for faster training on new tasks. Further research addresses variance reduction in policy gradients and the equivalence between policy gradients and soft Q-learning, alongside challenging robotics environments for multi-goal RL. AI

IMPACT Demonstrates significant progress in RL capabilities, including superhuman performance, safety, generalization, and exploration, pushing the boundaries of AI.

RANK_REASON Multiple research papers published by OpenAI on various aspects of reinforcement learning.

Read on OpenAI News →

AI-generated summary · Google Gemini · from 870 sources. How we write summaries →

OpenAI advances reinforcement learning with Dota 2, safety, and generalization

COVERAGE [870]

  1. OpenAI News TIER_1 English(EN) ·

    Dota 2 with large scale deep reinforcement learning

  2. OpenAI News TIER_1 English(EN) ·

    Benchmarking safe exploration in deep reinforcement learning

  3. OpenAI News TIER_1 English(EN) ·

    Quantifying generalization in reinforcement learning

    We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in reinforcement learning. CoinRun strikes a desirable balance in complexity: the env…

  4. OpenAI News TIER_1 English(EN) ·

    Reinforcement learning with prediction-based rewards

    We’ve developed Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first time exceeds average human performance on Montezuma’s Revenge.

  5. OpenAI News TIER_1 English(EN) ·

    Learning policy representations in multiagent systems

  6. OpenAI News TIER_1 English(EN) ·

    Evolved Policy Gradients

    We’re releasing an experimental metalearning approach called Evolved Policy Gradients, a method that evolves the loss function of learning agents, which can enable fast training on novel tasks. Agents trained with EPG can succeed at basic tasks at test time that were outside thei…

  7. OpenAI News TIER_1 English(EN) ·

    Variance reduction for policy gradient with action-dependent factorized baselines

  8. OpenAI News TIER_1 English(EN) ·

    Some considerations on learning to explore via meta-reinforcement learning

  9. OpenAI News TIER_1 English(EN) ·

    Multi-Goal Reinforcement Learning: Challenging robotics environments and request for research

  10. OpenAI News TIER_1 English(EN) ·

    Equivalence between policy gradients and soft Q-learning

  11. OpenAI News TIER_1 English(EN) ·

    Stochastic Neural Networks for hierarchical reinforcement learning

  12. OpenAI News TIER_1 English(EN) ·

    #Exploration: A study of count-based exploration for deep reinforcement learning

  13. OpenAI News TIER_1 English(EN) ·

    RL²: Fast reinforcement learning via slow reinforcement learning

  14. Apple Machine Learning Research TIER_1 English(EN) ·

    PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

    Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents using outcome-only rewards suffers from credit-assignment ambiguity, obscuring which…

  15. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Reward Hacking in Reinforcement Learning

    <p>Reward hacking occurs when a <a href="(https://lilianweng.github.io/posts/2018-02-19-rl-overview/)">reinforcement learning (RL)</a> agent <a href="https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration">exploits</a> flaws or ambiguities i…

  16. Hugging Face Blog TIER_1 English(EN) ·

    Introducing ⚔️ AI vs. AI ⚔️ a deep reinforcement learning multi-agents competition system

  17. Hugging Face Blog TIER_1 English(EN) ·

    Illustrating Reinforcement Learning from Human Feedback (RLHF)

  18. Hugging Face Blog TIER_1 English(EN) ·

    An Introduction to Deep Reinforcement Learning

  19. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Exploration Strategies in Deep Reinforcement Learning

    <!-- Exploitation versus exploration is a critical topic in reinforcement learning. This post introduces several common approaches for better exploration in Deep RL. --> <p><span class="update">[Updated on 2020-06-17: Add <a href="#exploration-via-disagreement">&ldquo;exploration…

  20. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Curriculum for Reinforcement Learning

    <!-- A curriculum is an efficient tool for humans to progressively learn from simple concepts to hard problems. It breaks down complex knowledge by providing a sequence of learning steps of increasing difficulty. In this post, we will examine how the idea of curriculum can help r…

  21. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Meta Reinforcement Learning

    <!-- Meta-RL is meta-learning on reinforcement learning tasks. After trained over a distribution of tasks, the agent is able to solve a new task by developing a new RL algorithm with its internal activity dynamics. This post starts with the origin of meta-RL and then dives into t…

  22. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Implementing Deep Reinforcement Learning Models with Tensorflow + OpenAI Gym

    <!-- Let's see how to implement a number of classic deep reinforcement learning models in code. --> <p>The full implementation is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">lilianweng/deep-reinforcement-learning-gym</a></p> <p>In the prev…

  23. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    Policy Gradient Algorithms

    <!-- Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, …

  24. Lil'Log (Lilian Weng) TIER_1 English(EN) ·

    A (Long) Peek into Reinforcement Learning

    <!-- In this post, we are gonna briefly go over the field of Reinforcement Learning (RL), from fundamental concepts to classic algorithms. Hopefully, this review is helpful enough so that newbies would not get lost in specialized terms and jargons while starting. [WARNING] This i…

  25. Andrej Karpathy TIER_1 English(EN) · Andrej Karpathy ·

    Pong AI with Policy Gradients

    Trained for ~8000 episodes, each episode = ~30 games. Updates were done in batches of 10 episodes, so ~800 updates total. Policy network is a 2-layer neural net connected to raw pixels, with 200 hidden units. Trained with RMSProp and learning rate 1e-4. The final agent does not b…

  26. arXiv cs.LG TIER_1 English(EN) · Hsiao-Ru Pan, Bernhard Sch\"olkopf ·

    Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

    arXiv:2606.20411v1 Announce Type: new Abstract: Direct Advantage Estimation (DAE) has been shown to improve the sample efficiency of deep reinforcement learning algorithms. However, its reliance on full environment observability limits its applicability in realistic settings, and…

  27. arXiv cs.AI TIER_1 English(EN) · Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack ·

    Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

    arXiv:2606.19357v1 Announce Type: cross Abstract: We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroll…

  28. arXiv cs.AI TIER_1 English(EN) · Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang ·

    MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

    arXiv:2506.14990v3 Announce Type: replace Abstract: Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 se…

  29. arXiv cs.AI TIER_1 English(EN) · ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim ·

    MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

    arXiv:2510.18383v3 Announce Type: replace-cross Abstract: Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor…

  30. arXiv cs.LG TIER_1 English(EN) · Bernhard Schölkopf ·

    Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

    Direct Advantage Estimation (DAE) has been shown to improve the sample efficiency of deep reinforcement learning algorithms. However, its reliance on full environment observability limits its applicability in realistic settings, and its requirement to model transition probabiliti…

  31. arXiv cs.AI TIER_1 English(EN) · Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen ·

    Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

    arXiv:2606.18820v1 Announce Type: cross Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoff…

  32. arXiv cs.AI TIER_1 English(EN) · Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang ·

    TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

    arXiv:2606.18308v1 Announce Type: cross Abstract: Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these…

  33. arXiv cs.AI TIER_1 English(EN) · Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas ·

    Self-CTRL: Self-Consistency Training with Reinforcement Learning

    arXiv:2606.18327v1 Announce Type: cross Abstract: Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that …

  34. arXiv cs.AI TIER_1 English(EN) · Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang ·

    Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

    arXiv:2606.18810v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routi…

  35. arXiv cs.LG TIER_1 English(EN) · Yiyan Huang, Cheuk Hang Leung, Qi Wu, Zhiheng Zhang ·

    Wasserstein Policy Learning for Distributional Outcomes

    arXiv:2606.19117v1 Announce Type: cross Abstract: Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare …

  36. arXiv cs.LG TIER_1 English(EN) · Xuanfei Ren, Tengyang Xie ·

    When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

    arXiv:2606.18531v1 Announce Type: cross Abstract: Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimizat…

  37. arXiv cs.CL TIER_1 English(EN) · Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen ·

    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

    arXiv:2606.18902v1 Announce Type: new Abstract: Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (AP…

  38. arXiv cs.AI TIER_1 English(EN) · Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart ·

    UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

    arXiv:2606.19328v1 Announce Type: cross Abstract: Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suff…

  39. arXiv cs.AI TIER_1 English(EN) · Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao ·

    Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

    arXiv:2606.18831v1 Announce Type: cross Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as …

  40. arXiv cs.AI TIER_1 English(EN) · Nicholas Rhinehart ·

    UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

    Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during …

  41. arXiv cs.CL TIER_1 English(EN) · Jinghong Chen ·

    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

    Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stocha…

  42. Hugging Face Daily Papers TIER_1 English(EN) ·

    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

    Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stocha…

  43. arXiv cs.AI TIER_1 English(EN) · Chaojun Xiao ·

    Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

    Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, ye…

  44. arXiv cs.AI TIER_1 English(EN) · Xuebin Chen ·

    Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

    Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard …

  45. Hugging Face Daily Papers TIER_1 English(EN) ·

    Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning …

  46. arXiv cs.AI TIER_1 English(EN) · Heyan Huang ·

    Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning …

  47. arXiv cs.AI TIER_1 English(EN) · Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu ·

    Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

    arXiv:2606.17735v1 Announce Type: new Abstract: Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations int…

  48. arXiv cs.AI TIER_1 English(EN) · Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He ·

    Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

    arXiv:2606.17591v1 Announce Type: new Abstract: Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and in…

  49. arXiv cs.LG TIER_1 English(EN) · Andreas Athanasopoulos, Christos Dimitrakakis ·

    Learning in Matching Games with Bandit Feedback

    arXiv:2506.03802v2 Announce Type: replace Abstract: We introduce a learning problem in a generalized two-sided matching market, where agents select actions to interact with their match. Specifically, we consider a setting in which matched agents engage in zero-sum games with init…

  50. arXiv cs.LG TIER_1 English(EN) · Steve Halley, Maur\'icio Gruppi ·

    Deep Reinforcement Learning for Minimum Zero-Forcing Sets

    arXiv:2606.18106v1 Announce Type: new Abstract: This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem w…

  51. arXiv cs.LG TIER_1 English(EN) · Cosmin Borsa, Michael Ludkovski ·

    Continuous-time Optimal Stopping through Deep Reinforcement Learning

    arXiv:2606.17545v1 Announce Type: new Abstract: Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal e…

  52. arXiv cs.CL TIER_1 English(EN) · Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li ·

    EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

    arXiv:2606.17680v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuit…

  53. arXiv cs.CL TIER_1 English(EN) · Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu ·

    Learning from the Self-future: On-policy Self-distillation for dLLMs

    arXiv:2606.18195v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. T…

  54. arXiv cs.AI TIER_1 English(EN) · Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng ·

    When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    arXiv:2605.05172v2 Announce Type: replace-cross Abstract: Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online lea…

  55. arXiv cs.AI TIER_1 English(EN) · Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll ·

    Knowledge Reutilization in Meta-Reinforcement Learning

    arXiv:2606.18132v1 Announce Type: new Abstract: Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-param…

  56. arXiv cs.CL TIER_1 English(EN) · Shiwei Liu ·

    Learning from the Self-future: On-policy Self-distillation for dLLMs

    On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-ri…

  57. arXiv cs.AI TIER_1 English(EN) · Alois Knoll ·

    Knowledge Reutilization in Meta-Reinforcement Learning

    Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, …

  58. arXiv cs.LG TIER_1 English(EN) · Maurício Gruppi ·

    Deep Reinforcement Learning for Minimum Zero-Forcing Sets

    This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propag…

  59. arXiv cs.CL TIER_1 English(EN) · Juanzi Li ·

    EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

    Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamic…

  60. arXiv cs.LG TIER_1 English(EN) · Michael Ludkovski ·

    Continuous-time Optimal Stopping through Deep Reinforcement Learning

    Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, app…

  61. arXiv cs.LG TIER_1 English(EN) · Jongmin Lee, Ernest K. Ryu, Vaneet Aggarwal ·

    Learning Policy from a Single Trajectory in Average-Reward Markov Decision Process

    arXiv:2606.16729v1 Announce Type: new Abstract: While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assump…

  62. arXiv cs.CL TIER_1 English(EN) · Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi ·

    Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning

    arXiv:2510.17431v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find …

  63. arXiv cs.AI TIER_1 English(EN) · Nathan Gavenski, Juarez Monteiro, Francisco Galuppo, Adriano Veloso, Odinaldo Rodrigues ·

    When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

    arXiv:2606.16995v1 Announce Type: new Abstract: Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with…

  64. arXiv cs.AI TIER_1 English(EN) · Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang ·

    StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

    arXiv:2606.15197v1 Announce Type: cross Abstract: Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or …

  65. arXiv cs.AI TIER_1 English(EN) · Gengsheng Li, Mao Zheng, Mingyang Song, Ruiqi Liu, Tianyu Yang, Jie Sun, Qiyong Zhong, Haiyun Guo, Junfeng Fang, Dan Zhang, Jinqiao Wang ·

    On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

    arXiv:2606.15912v1 Announce Type: cross Abstract: Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in prac…

  66. arXiv cs.AI TIER_1 English(EN) · Swaminathan S K, Damiya Gondha, Theyanesh Eswaramoorthy Rajahkrishnan, Aritra Hazra ·

    Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

    arXiv:2606.16515v1 Announce Type: cross Abstract: Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor …

  67. arXiv cs.AI TIER_1 English(EN) · Ardianto Wibowo, Paulo E Santos, Amer Baghdadi, Matthew Stephenson, Karl Sammut, Jean-Philippe Diguet ·

    A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

    arXiv:2606.16933v1 Announce Type: cross Abstract: Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between traini…

  68. arXiv cs.AI TIER_1 English(EN) · Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause ·

    Safe Exploration via Policy Priors

    arXiv:2601.19612v3 Announce Type: replace-cross Abstract: Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet co…

  69. arXiv cs.LG TIER_1 English(EN) · Timo Brand, Henry F\"orster, Stephen Kobourov, Daniel Kohrt, Robin Schukrafft, Markus Wallinger, Johannes Zink ·

    Using Reinforcement Learning to Optimize the Global and Local Crossing Number

    arXiv:2509.06108v2 Announce Type: replace-cross Abstract: Graph drawing concerns the algorithmic visualization of graphs. A good drawing of a graph is easy to read and facilitates solving tasks on the graph. Several properties have been identified to occur in good drawings of gra…

  70. arXiv cs.LG TIER_1 English(EN) · Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai ·

    FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies

    arXiv:2603.27450v2 Announce Type: replace Abstract: Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due …

  71. arXiv cs.LG TIER_1 English(EN) · Raj Ghugare, Micha{\l} Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach ·

    On the Role of Computation in Reinforcement Learning

    arXiv:2602.05999v3 Announce Type: replace Abstract: How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not pro…

  72. arXiv cs.LG TIER_1 English(EN) · Yi Zhao, Aidan Scannell, Wenshuai Zhao, Yuxin Hou, Tianyu Cui, Le Chen, Dieter B\"uchler, Arno Solin, Juho Kannala, Joni Pajarinen ·

    Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

    arXiv:2502.19544v3 Announce Type: replace Abstract: Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that …

  73. arXiv cs.LG TIER_1 English(EN) · \c{S}evket Kaan Alk{\i}r, Naci Sald{\i}, Berkay Anahtarc{\i}, Can Deha Kar{\i}ks{\i}z ·

    Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

    arXiv:2606.16759v1 Announce Type: new Abstract: We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unkn…

  74. arXiv cs.LG TIER_1 English(EN) · Ekasit Usaratniwart, Xilin Gao, Marc Ong, Youhei Akimoto ·

    Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

    arXiv:2606.16236v1 Announce Type: new Abstract: Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but requir…

  75. Hugging Face Daily Papers TIER_1 English(EN) ·

    When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

    Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.

  76. Hugging Face Daily Papers TIER_1 English(EN) ·

    Learning from the Self-future: On-policy Self-distillation for dLLMs

    d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models.

  77. arXiv cs.AI TIER_1 English(EN) · Odinaldo Rodrigues ·

    When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

    Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM)…

  78. arXiv cs.AI TIER_1 English(EN) · Jean-Philippe Diguet ·

    A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

    Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and …

  79. arXiv cs.LG TIER_1 English(EN) · Can Deha Karıksız ·

    Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

    We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy …

  80. arXiv cs.LG TIER_1 English(EN) · Vaneet Aggarwal ·

    Learning Policy from a Single Trajectory in Average-Reward Markov Decision Process

    While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generati…

  81. arXiv cs.AI TIER_1 English(EN) · Aritra Hazra ·

    Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

    Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically …

  82. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Youhei Akimoto ·

    Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

    Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and fu…

  83. arXiv cs.AI TIER_1 English(EN) · Ayoub Belouadah, Sylvain Kubler, Yves Le Traon ·

    CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

    arXiv:2606.14415v1 Announce Type: new Abstract: Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they of…

  84. arXiv cs.LG TIER_1 English(EN) · Kai S. Yun, Zeyang Li, Navid Azizan ·

    Provably Safe, Yet Scalable Reinforcement Learning

    arXiv:2606.14536v1 Announce Type: new Abstract: Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provi…

  85. arXiv cs.LG TIER_1 English(EN) · Omar Adalat, Edwin Hamel-De le Court, Francesco Belardinelli ·

    Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

    arXiv:2606.14130v1 Announce Type: new Abstract: Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentrali…

  86. arXiv cs.AI TIER_1 English(EN) · Kai Fukazawa, Kunal Mundada, Iman Soltani ·

    RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

    arXiv:2510.02695v3 Announce Type: replace-cross Abstract: In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse…

  87. arXiv cs.AI TIER_1 English(EN) · Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li ·

    Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

    arXiv:2606.14375v1 Announce Type: cross Abstract: Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control…

  88. arXiv cs.LG TIER_1 English(EN) · Navid Azizan ·

    Provably Safe, Yet Scalable Reinforcement Learning

    Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned poli…

  89. arXiv cs.AI TIER_1 English(EN) · Yves Le Traon ·

    CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

    Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, l…

  90. arXiv cs.AI TIER_1 English(EN) · Zhen Li ·

    Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

    Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more c…

  91. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Francesco Belardinelli ·

    Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

    Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but p…

  92. arXiv cs.AI TIER_1 English(EN) · Junfeng Guo Heng Huang ·

    PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

    arXiv:2606.12896v1 Announce Type: cross Abstract: While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerab…

  93. arXiv cs.CL TIER_1 English(EN) · Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo ·

    Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

    arXiv:2606.13106v1 Announce Type: cross Abstract: Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) an…

  94. arXiv cs.CL TIER_1 English(EN) · Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang ·

    SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

    arXiv:2606.12908v1 Announce Type: new Abstract: Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-polic…

  95. arXiv cs.AI TIER_1 English(EN) · Mintae Kim, Koushil Sreenath ·

    WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

    arXiv:2604.08958v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically …

  96. arXiv cs.LG TIER_1 English(EN) · Shaivi Malik ·

    Reinforcement Learning for Neural Model Editing

    Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learni…

  97. arXiv cs.CL TIER_1 English(EN) · Zhijiang Guo ·

    Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

    Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is t…

  98. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Arnaud Braud ·

    $α$-fair heterogeneous agent reinforcement learning

    Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social b…

  99. arXiv cs.CL TIER_1 English(EN) · Dakuo Wang ·

    SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

    Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own e…

  100. arXiv cs.CL TIER_1 English(EN) · Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen ·

    RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

    arXiv:2606.11709v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. Howe…

  101. arXiv cs.LG TIER_1 English(EN) · Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang ·

    UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

    arXiv:2606.12372v1 Announce Type: cross Abstract: Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain interven…

  102. arXiv cs.LG TIER_1 English(EN) · Bal\'azs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann ·

    Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

    arXiv:2606.12334v1 Announce Type: new Abstract: High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directl…

  103. arXiv cs.LG TIER_1 English(EN) · Felix St\"orck, Fabian Hinder, Barbara Hammer ·

    Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

    arXiv:2606.11797v1 Announce Type: new Abstract: Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior tha…

  104. arXiv cs.AI TIER_1 English(EN) · Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto ·

    Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

    arXiv:2603.14867v4 Announce Type: replace-cross Abstract: Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower s…

  105. arXiv cs.AI TIER_1 English(EN) · Xin Chen, Jie Zhang, Florian Tram\`er ·

    Learning to Inject: Automated Prompt Injection via Reinforcement Learning

    arXiv:2602.05746v2 Announce Type: replace-cross Abstract: Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks sh…

  106. arXiv cs.AI TIER_1 English(EN) · Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang ·

    Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

    arXiv:2509.10303v2 Announce Type: replace-cross Abstract: Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated env…

  107. arXiv cs.AI TIER_1 English(EN) · Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singel\'ee, Robin Degraeve, Bart Preneel ·

    Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

    arXiv:2606.12251v1 Announce Type: cross Abstract: Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcem…

  108. arXiv cs.AI TIER_1 English(EN) · Frank Xiao, Mary Phuong ·

    Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

    arXiv:2606.12016v1 Announce Type: cross Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware,…

  109. arXiv cs.AI TIER_1 English(EN) · Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen ·

    Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

    arXiv:2606.11634v1 Announce Type: new Abstract: The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding…

  110. Hugging Face Daily Papers TIER_1 English(EN) ·

    Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

    A switchable latent reasoning framework uses explicit boundary tokens to enable trainable and interpretable latent reasoning through recurrent hidden states.

  111. arXiv cs.LG TIER_1 English(EN) · Ziwei Wang ·

    UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

    Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human correcti…

  112. arXiv cs.LG TIER_1 English(EN) · Gerhard Neumann ·

    Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

    High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a …

  113. arXiv cs.AI TIER_1 English(EN) · Bart Preneel ·

    Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

    Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradien…

  114. arXiv cs.AI TIER_1 English(EN) · Mary Phuong ·

    Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

    Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the…

  115. arXiv cs.LG TIER_1 English(EN) · Barbara Hammer ·

    Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

    Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-s…

  116. arXiv cs.CL TIER_1 English(EN) · Lijie Wen ·

    RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

    On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from t…

  117. Hugging Face Daily Papers TIER_1 English(EN) ·

    RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

    On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from t…

  118. arXiv cs.AI TIER_1 English(EN) · Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta ·

    Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

    arXiv:2606.10705v1 Announce Type: cross Abstract: Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of proces…

  119. arXiv cs.AI TIER_1 English(EN) · Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu ·

    Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

    arXiv:2606.10346v1 Announce Type: new Abstract: Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encour…

  120. arXiv cs.AI TIER_1 English(EN) · Alessandro Trapasso, Luca Iocchi, Fabio Patrizi ·

    Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

    arXiv:2512.14617v2 Announce Type: replace-cross Abstract: Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not s…

  121. arXiv cs.AI TIER_1 English(EN) · Lucas Schott, Josephine Delas, Hatem Hajri, Elies Gherbi, Reda Yaich, Nora Boulahia-Cuppens, Frederic Cuppens, Sylvain Lamprier ·

    Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey

    arXiv:2403.00420v3 Announce Type: replace-cross Abstract: Deep Reinforcement Learning (DRL) is a subfield of machine learning for training autonomous agents that take sequential actions across complex environments. Despite its significant performance in well-known environments, i…

  122. arXiv cs.AI TIER_1 English(EN) · Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li ·

    RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

    arXiv:2510.14828v3 Announce Type: replace Abstract: Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language …

  123. arXiv cs.AI TIER_1 English(EN) · Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji ·

    TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

    arXiv:2606.11119v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient r…

  124. arXiv cs.AI TIER_1 English(EN) · Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine ·

    Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

    arXiv:2606.11087v1 Announce Type: cross Abstract: Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the superv…

  125. arXiv cs.AI TIER_1 English(EN) · Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu ·

    Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

    arXiv:2606.10968v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens …

  126. arXiv cs.AI TIER_1 English(EN) · Jo\~ao Coelho, Jo\~ao Magalh\~aes, Bruno Martins, Chenyan Xiong ·

    Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

    arXiv:2606.10709v1 Announce Type: cross Abstract: The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes success…

  127. arXiv cs.AI TIER_1 English(EN) · Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo ·

    Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

    arXiv:2606.10613v1 Announce Type: cross Abstract: Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to …

  128. arXiv cs.LG TIER_1 English(EN) · Auguste Lehuger, Guillaume Henon-Just ·

    Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

    arXiv:2606.10611v1 Announce Type: new Abstract: Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical…

  129. arXiv cs.LG TIER_1 English(EN) · Tai Nguyen, Phong Le, Carola Doerr, Nguyen Dang ·

    Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

    arXiv:2606.10129v1 Announce Type: new Abstract: While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, o…

  130. Hugging Face Daily Papers TIER_1 English(EN) ·

    TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

    TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.

  131. arXiv cs.CL TIER_1 English(EN) · Xiangyang Ji ·

    TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

    Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or comp…

  132. arXiv cs.LG TIER_1 English(EN) · Sergey Levine ·

    Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

    Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating the…

  133. arXiv cs.LG TIER_1 English(EN) · Wenxi Zhu ·

    Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

    Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts …

  134. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Chenyan Xiong ·

    Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

    The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-in…

  135. arXiv cs.AI TIER_1 English(EN) · Andrea Matta ·

    Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

    Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. Th…

  136. arXiv cs.AI TIER_1 English(EN) · Chang D. Yoo ·

    Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

    Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step…

  137. Hugging Face Daily Papers TIER_1 English(EN) ·

    Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

    Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforc…

  138. arXiv cs.LG TIER_1 English(EN) · Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan ·

    Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

    arXiv:2606.08779v1 Announce Type: new Abstract: Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train…

  139. arXiv cs.LG TIER_1 English(EN) · Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang ·

    Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

    arXiv:2603.10395v2 Announce Type: replace Abstract: Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexi…

  140. arXiv cs.LG TIER_1 English(EN) · Lingkai Kong, Anagha Satish, Hezi Jiang, Akseli Kangaslahti, Andrew Ma, Wenbo Chen, Mingxiao Song, Lily Xu, Milind Tambe ·

    Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions

    arXiv:2601.22211v2 Announce Type: replace Abstract: Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impract…

  141. arXiv cs.LG TIER_1 English(EN) · Paulius Sasnauskas, Yi\u{g}it Yal{\i}n, Goran Radanovi\'c ·

    Robust In-Context Reinforcement Learning Under Reward Poisoning Attacks

    arXiv:2506.06891v3 Announce Type: replace Abstract: We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we …

  142. arXiv cs.LG TIER_1 English(EN) · Qinghe Gao, Artur M. Schweidtmann ·

    Deep reinforcement learning for process design: Review and perspective

    arXiv:2308.07822v2 Announce Type: replace Abstract: The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate th…

  143. arXiv cs.LG TIER_1 English(EN) · Alexander DeRieux, Walid Saad ·

    QnRL: Quantum-Native Reinforcement Learning

    arXiv:2606.08276v1 Announce Type: cross Abstract: Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these envi…

  144. arXiv cs.LG TIER_1 English(EN) · Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu ·

    Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

    arXiv:2606.09138v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focu…

  145. arXiv cs.LG TIER_1 English(EN) · Jike Zhong, Yuxiang Lai, Ming Li, Yuheng Li, Wuao Liu, Behzad Dariush, Konstantinos Psounis, Shao-Yuan Lo ·

    From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

    arXiv:2606.09092v1 Announce Type: new Abstract: Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is co…

  146. arXiv cs.LG TIER_1 English(EN) · Aditya Upadhyay ·

    UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

    arXiv:2606.07592v1 Announce Type: new Abstract: Offline reinforcement learning requires careful conservatism to mitigate distribution shift, yet most existing methods apply a fixed penalty uniformly across all states regardless of local data coverage. We present UNIQ (Uncertainty…

  147. arXiv cs.AI TIER_1 English(EN) · Fernando Martinez-Lopez, Tao Li, Yingdong Lu, Juntao Chen ·

    In-Context Reinforcement Learning via Communicative World Models

    arXiv:2508.06659v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their t…

  148. arXiv cs.AI TIER_1 English(EN) · Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Mingyu Liu, Zheng Huang, Anzhou Li, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen ·

    ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

    arXiv:2505.21457v2 Announce Type: replace-cross Abstract: Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in hum…

  149. arXiv cs.AI TIER_1 English(EN) · Shixiong Jiang, Taozheng Zhu, Fanxin Kong ·

    Safe-RULE: Safe Reinforcement UnLEarning

    arXiv:2606.09559v1 Announce Type: cross Abstract: Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline S…

  150. arXiv cs.AI TIER_1 English(EN) · Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki ·

    HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

    arXiv:2606.08610v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, …

  151. arXiv cs.AI TIER_1 English(EN) · Yuchen He, Baolong Bi, Shenghua Liu, Huaming Liao, Yuyao Ge, Bolin Wan, Siqian Tong, Juan Chen, Jiafeng Guo, Xueqi Cheng ·

    SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

    arXiv:2606.07705v1 Announce Type: cross Abstract: Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: rewa…

  152. arXiv cs.AI TIER_1 English(EN) · Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao ·

    Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

    arXiv:2606.08815v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely …

  153. arXiv cs.AI TIER_1 English(EN) · Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo ·

    Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

    arXiv:2606.08735v1 Announce Type: new Abstract: Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evalua…

  154. arXiv cs.AI TIER_1 English(EN) · Ashkan Ansarifard (Sapienza University of Rome), Matteo Mancanelli (Sapienza University of Rome), Elena Umili (Sapienza University of Rome), Fabio Patrizi (Sapienza University of Rome) ·

    Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

    arXiv:2606.08312v1 Announce Type: new Abstract: In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformer…

  155. Hugging Face Daily Papers TIER_1 English(EN) ·

    Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

    QGF is an RL algorithm that improves policies at test time by using a value gradient to guide a pre-trained flow policy, avoiding training-time instability while maintaining competitive performance.

  156. Hugging Face Daily Papers TIER_1 English(EN) ·

    Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

    CPPO addresses limitations in reinforcement learning with verifiable rewards by introducing position-weighted thresholds and cumulative prefix budgeting to better handle autoregressive generation challenges.

  157. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Nguyen Dang ·

    Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

    While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, in…

  158. arXiv cs.AI TIER_1 English(EN) · Fanxin Kong ·

    Safe-RULE: Safe Reinforcement UnLEarning

    Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversarie…

  159. arXiv cs.CL TIER_1 English(EN) · Qi Liu ·

    Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

    Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and traini…

  160. arXiv cs.AI TIER_1 English(EN) · Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi ·

    Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

    arXiv:2601.18510v2 Announce Type: replace-cross Abstract: While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but …

  161. arXiv cs.AI TIER_1 English(EN) · Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang ·

    CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space

    arXiv:2601.05675v2 Announce Type: replace Abstract: Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space rema…

  162. arXiv cs.LG TIER_1 English(EN) · Haruto Tanaka, A. Rupam Mahmood ·

    Performance Variation in Deep Reinforcement Learning

    arXiv:2606.06746v1 Announce Type: new Abstract: Deep reinforcement learning (RL) algorithms often suffer from low run-to-run robustness, manifesting as significant performance variation across independent runs of identically configured agents. Although this issue poses a spectrum…

  163. arXiv cs.LG TIER_1 English(EN) · Ujjwal Bhatta, Utsabi Dangol, Sumaly Bajracharya, Rodrigue Rizk, KC Santosh ·

    Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

    arXiv:2606.06673v1 Announce Type: new Abstract: Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Gui…

  164. arXiv cs.AI TIER_1 English(EN) · Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang ·

    ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    arXiv:2604.08168v2 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning …

  165. arXiv cs.AI TIER_1 English(EN) · Wo Wei Lin, Ethan Rathbun, Enrico Marchesini, Xiang Zhi Tan ·

    Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning

    arXiv:2605.12655v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewar…

  166. arXiv cs.LG TIER_1 English(EN) · Wenpu Liu, Yuqi Xu, Weichu Xie, Yongfu Zhu, Shuai Dong, Ziyue Wang, Wenqi Shao, Xiaoying Zhang, Tong Yang, Nan Duan, Jiaqi Wang ·

    Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

    arXiv:2605.17333v2 Announce Type: replace Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the…

  167. arXiv cs.LG TIER_1 English(EN) · Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li ·

    SHAP-Guided Kernel Actor-Critic for Explainable Reinforcement Learning

    arXiv:2512.05291v3 Announce Type: replace Abstract: Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state feature…

  168. arXiv cs.AI TIER_1 English(EN) · Fabio Patrizi ·

    Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

    In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to…

  169. arXiv cs.LG TIER_1 English(EN) · Walid Saad ·

    QnRL: Quantum-Native Reinforcement Learning

    Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these environments, existing QRL architectures indirectly ap…

  170. arXiv cs.AI TIER_1 English(EN) · Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel ·

    Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

    arXiv:2503.01734v3 Announce Type: replace-cross Abstract: Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generat…

  171. arXiv cs.LG TIER_1 English(EN) · Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang ·

    Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

    arXiv:2512.09706v2 Announce Type: replace Abstract: The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces-such as exclusively using APIs, GUI eve…

  172. arXiv cs.LG TIER_1 English(EN) · Boyang Xu, Qing Zou, Siqin Yang, Hao Yan ·

    Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    arXiv:2605.08253v2 Announce Type: replace Abstract: Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismat…

  173. arXiv cs.LG TIER_1 English(EN) · Ali Saheb Pasand, Johan Obando-Ceron, Aaron Courville, Pouya Bashivan, Pablo Samuel Castro ·

    Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

    arXiv:2602.19373v3 Announce Type: replace Abstract: Deep reinforcement learning systems often suffer from unstable training dynamics due to non-stationarity, where learning objectives and data distributions evolve over time. We show that under non-stationary targets, isotropic Ga…

  174. arXiv cs.LG TIER_1 English(EN) · Elizabeth Bates, Chris Hicks, Vasilios Mavroudis ·

    Beyond Rewards in Reinforcement Learning for Cyber Defence

    arXiv:2602.04809v3 Announce Type: replace Abstract: Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, …

  175. arXiv cs.LG TIER_1 English(EN) · Giorgio Maria Cavallazzi, Miguel P\'erez-Cuadrado, Alfredo Pinelli ·

    Drag reduction or reward hacking? Recurrent multi-agent reinforcement learning that earns its reward

    arXiv:2606.06227v1 Announce Type: cross Abstract: A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-…

  176. arXiv cs.LG TIER_1 English(EN) · Nguyen Cong Luong, Shaohan Feng, Nguyen Duc Hai, Zeping Sui, Bo Ma, Min Xu, Zhihao Dong, Qiushi Zhao, Nguyen Duc Duy Anh, Nguyen Quoc Khanh, Ngoc Hung Nguyen, Zitian Zhang, Jie Cao ·

    Transformer-Enhanced Reinforcement Learning: Fundamentals and Applications in Communication Networks

    arXiv:2606.05208v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has long been a powerful solution to various problems in communication networks. However, traditional RL models still face with several limitations. Not only do they rely on large numbers of interaction…

  177. arXiv cs.LG TIER_1 English(EN) · Haoyang Hong, Zichen Wang, Quanquan Gu, Huazheng Wang ·

    Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

    arXiv:2606.06053v1 Announce Type: new Abstract: We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecif…

  178. arXiv cs.LG TIER_1 English(EN) · Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen ·

    When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

    arXiv:2606.05885v1 Announce Type: new Abstract: Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level …

  179. arXiv cs.LG TIER_1 English(EN) · Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro ·

    Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

    arXiv:2606.05555v1 Announce Type: new Abstract: Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it uncle…

  180. arXiv cs.LG TIER_1 English(EN) · Chirag Chawla, Rohan Charudatt Salvi, Madhav S. Baidya ·

    Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

    arXiv:2606.05434v1 Announce Type: new Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We i…

  181. arXiv cs.LG TIER_1 English(EN) · Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, No\"el Vouitsis, Brendan Leigh Ross ·

    Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

    arXiv:2606.05296v1 Announce Type: new Abstract: LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-…

  182. arXiv cs.LG TIER_1 English(EN) · Renwei Meng ·

    Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

    arXiv:2606.05263v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing proc…

  183. Hugging Face Daily Papers TIER_1 English(EN) ·

    StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

    StepPO introduces a step-centric approach for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming existing token-centric methods in multi-turn interaction tasks.

  184. arXiv cs.LG TIER_1 English(EN) · Alfredo Pinelli ·

    Drag reduction or reward hacking? Recurrent multi-agent reinforcement learning that earns its reward

    A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-conservation projection couples agents' outputs an…

  185. arXiv cs.LG TIER_1 English(EN) · Huazheng Wang ·

    Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

    We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecified models, where classical regret bounds may fa…

  186. arXiv cs.AI TIER_1 English(EN) · Saket Tiwari, Tejas Kotwal, George Konidaris ·

    From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

    arXiv:2606.04275v1 Announce Type: cross Abstract: We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on pre…

  187. arXiv cs.AI TIER_1 English(EN) · Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui, Qi Zhu, Fei Mi, Hongning Wang, Minlie Huang ·

    RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

    arXiv:2606.04051v1 Announce Type: cross Abstract: The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or st…

  188. arXiv cs.AI TIER_1 English(EN) · Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad ·

    Reinforcement Learning from Rich Feedback with Distributional DAgger

    arXiv:2606.05152v1 Announce Type: cross Abstract: Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the fina…

  189. arXiv cs.AI TIER_1 English(EN) · Mohit Prashant, Arvind Easwaran ·

    Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

    arXiv:2606.04812v1 Announce Type: cross Abstract: Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unkn…

  190. arXiv cs.AI TIER_1 English(EN) · Viktor Vesel\'y, Aleksandar Todorov, Erwan Escudie, Matthia Sabatelli ·

    Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

    arXiv:2606.04735v1 Announce Type: cross Abstract: Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement lea…

  191. arXiv cs.LG TIER_1 English(EN) · Jiayi Wang, Zhengling Qi, Chengchun Shi ·

    Blessing from Human-AI Interaction: Super Reinforcement Learning in Confounded Environments

    arXiv:2209.15448v3 Announce Type: replace Abstract: As AI becomes more prevalent throughout society, effective methods of integrating humans and AI systems that leverage their respective strengths and mitigate risk have become an important priority. In this paper, we introduce th…

  192. arXiv cs.LG TIER_1 English(EN) · Guopeng Li, Moritz A. Zanger, Matthijs T. J. Spaan, Julian F. P. Kooij ·

    COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection

    arXiv:2606.04749v1 Announce Type: cross Abstract: Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled i…

  193. arXiv cs.LG TIER_1 English(EN) · Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan ·

    Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

    arXiv:2606.05021v1 Announce Type: new Abstract: We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each a…

  194. arXiv cs.LG TIER_1 English(EN) · Sabine Rieder, Stefan Pranger, Debraj Chakraborty, Jan K\v{r}et\'insk\'y, Bettina K\"onighofer ·

    Explainably Safe Reinforcement Learning

    arXiv:2606.04634v1 Announce Type: new Abstract: Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque.…

  195. arXiv cs.AI TIER_1 English(EN) · Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding ·

    AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

    arXiv:2606.04484v1 Announce Type: new Abstract: We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a dec…

  196. arXiv cs.AI TIER_1 English(EN) · Ajay Vishwanath, Christian Omlin ·

    Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

    arXiv:2606.04750v1 Announce Type: new Abstract: Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to in…

  197. arXiv cs.LG TIER_1 English(EN) · Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu ·

    Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning

    arXiv:2606.04492v1 Announce Type: new Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often…

  198. arXiv cs.CL TIER_1 English(EN) · Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan ·

    Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

    arXiv:2507.21892v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as…

  199. arXiv cs.CL TIER_1 English(EN) · Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu ·

    GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

    arXiv:2606.05002v1 Announce Type: new Abstract: LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. M…

  200. arXiv cs.AI TIER_1 English(EN) · Parnian Behdin, Kevin Roice, Golnaz Mesbahi ·

    Position: Deployed Reinforcement Learning should be Continual

    arXiv:2606.04029v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until p…

  201. arXiv cs.CL TIER_1 English(EN) · Tej Deep Pala, Vernon Toh, Soujanya Poria ·

    GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

    arXiv:2606.04889v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens,…

  202. arXiv cs.AI TIER_1 English(EN) · Melvin Laux, Yi-Ling Liu, Rina Alo, S\"oren T\"opper, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam ·

    Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

    arXiv:2604.12645v2 Announce Type: replace-cross Abstract: Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary u…

  203. arXiv cs.AI TIER_1 English(EN) · Jiashu Yao, Heyan Huang, Daiqing Wu, Zeming Liu, Yuhang Guo ·

    Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

    arXiv:2604.11510v2 Announce Type: replace-cross Abstract: To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entr…

  204. Hugging Face Daily Papers TIER_1 English(EN) ·

    Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

    Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalabilit…

  205. arXiv cs.LG TIER_1 English(EN) · Paria Rashidinejad ·

    Reinforcement Learning from Rich Feedback with Distributional DAgger

    Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide ric…

  206. arXiv cs.LG TIER_1 English(EN) · Hamza Khan ·

    Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

    We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, …

  207. arXiv cs.CL TIER_1 English(EN) · Zhiyuan Liu ·

    GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

    LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise t…

  208. arXiv cs.CL TIER_1 English(EN) · Soujanya Poria ·

    GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for …

  209. Hugging Face Daily Papers TIER_1 English(EN) ·

    GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for …

  210. arXiv cs.LG TIER_1 English(EN) · Arvind Easwaran ·

    Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

    Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verifi…

  211. arXiv cs.LG TIER_1 English(EN) · Christian Omlin ·

    Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

    Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully d…

  212. arXiv cs.LG TIER_1 English(EN) · Julian F. P. Kooij ·

    COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection

    Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wi…

  213. Hugging Face Daily Papers TIER_1 English(EN) ·

    Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

    Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB).…

  214. arXiv cs.LG TIER_1 English(EN) · Matthia Sabatelli ·

    Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

    Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB).…

  215. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Bolin Ding ·

    AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

    We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm se…

  216. Hugging Face Daily Papers TIER_1 English(EN) ·

    AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

    We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm se…

  217. arXiv cs.AI TIER_1 English(EN) · Minping Chen, Bowen Xiao, Du Liang, Chuxuan Zeng, Zeyi Wen ·

    Efficient Hyperparameter Optimization for LLM Reinforcement Learning

    arXiv:2606.03073v1 Announce Type: cross Abstract: Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO metho…

  218. arXiv cs.AI TIER_1 English(EN) · Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye ·

    EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

    arXiv:2606.03108v1 Announce Type: new Abstract: Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introdu…

  219. arXiv cs.AI TIER_1 English(EN) · Chengdong Ma, Th\'eo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Bo Li, Yuan Cheng, Yuan Qi, Yaodong Yang ·

    Finding Kissing Numbers with Game-theoretic Reinforcement Learning

    arXiv:2511.13391v4 Announce Type: replace-cross Abstract: Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a defining challenge in discrete geometry. As the local an…

  220. arXiv cs.AI TIER_1 English(EN) · Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia ·

    Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

    arXiv:2511.02304v2 Announce Type: replace-cross Abstract: We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables br…

  221. arXiv cs.AI TIER_1 English(EN) · Matteo Gallici, Ivan Masmitja, Mario Mart\'in ·

    Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

    arXiv:2505.08222v3 Announce Type: replace-cross Abstract: Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essent…

  222. arXiv cs.AI TIER_1 English(EN) · Leonard Hinckeldey, Elliot Fosong, Rimvydas Rubavicius, Elle Miller, Trevor McInroe, Fan Zhang, Patricia Wollstadt, Stefano V. Albrecht, Subramanian Ramamoorthy ·

    Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

    arXiv:2507.21638v2 Announce Type: replace Abstract: The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run a…

  223. arXiv cs.AI TIER_1 English(EN) · Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou ·

    Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

    arXiv:2606.03963v1 Announce Type: cross Abstract: Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fin…

  224. arXiv cs.AI TIER_1 English(EN) · Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Ayg\"un, David Smalling, Shibl Mourad, Doina Precup, Andr\'e Barreto, Mark Rowland ·

    Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

    arXiv:2606.03962v1 Announce Type: cross Abstract: Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity.…

  225. arXiv cs.AI TIER_1 English(EN) · Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi ·

    Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

    arXiv:2606.03892v1 Announce Type: cross Abstract: Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual stat…

  226. arXiv cs.AI TIER_1 English(EN) · Hongye Cao, Nuo Yan, Haoyuan Deng, Ziwei Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao ·

    Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

    arXiv:2606.03762v1 Announce Type: cross Abstract: Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-relian…

  227. arXiv cs.AI TIER_1 English(EN) · Siemen Herremans, Ali Anwar, Siegfried Mercelis ·

    Post-Hoc Robustness for Model-Based Reinforcement Learning

    arXiv:2606.03521v1 Announce Type: cross Abstract: To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a…

  228. arXiv cs.AI TIER_1 English(EN) · Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong, Mingxuan Yuan ·

    ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

    arXiv:2606.03070v1 Announce Type: cross Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected meth…

  229. arXiv cs.LG TIER_1 English(EN) · Stefan Pranger, Bettina K\"onighofer ·

    Easy-to-Use Shielding for Reinforcement Learning

    arXiv:2606.03804v1 Announce Type: new Abstract: Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that …

  230. arXiv cs.LG TIER_1 English(EN) · Can Lv, Mingju Chen, Heng Chang, Shiji Zhou ·

    Mitigating False Credit Propagation: Probabilistic Graphical Reward Aggregation for Rubric-Based Reinforcement Learning

    arXiv:2606.03361v1 Announce Type: new Abstract: Rubric-based rewards are increasingly used for open-ended language model post-training, but criterion-level scores are often aggregated as independent utilities. This flat scalarization ignores rubric-specified prerequisite and acti…

  231. arXiv cs.CL TIER_1 English(EN) · Yanyu Zhu, Hoilam Pao, Niu Hu, Wei Guo, Shaoxiong Zhan, Boyu Lai, Zitai Wang, Yongqin Zeng, Hai-Tao Zheng ·

    Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

    arXiv:2606.03113v1 Announce Type: new Abstract: Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We refram…

  232. Hugging Face Daily Papers TIER_1 English(EN) ·

    Reinforcement Learning from Rich Feedback with Distributional DAgger

    Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.

  233. Hugging Face Daily Papers TIER_1 English(EN) ·

    GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

    Gradient-Reweighted Advantage (GRAIL) improves mathematical reasoning in LLMs by reweighting token-wise advantages based on gradient-activation saliency, outperforming GRPO in accuracy and Pass@3 metrics.

  234. arXiv cs.AI TIER_1 English(EN) · Dzmitry Tsetserukou ·

    Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

    Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not gua…

  235. arXiv cs.AI TIER_1 English(EN) · Mark Rowland ·

    Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

    Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization …

  236. Hugging Face Daily Papers TIER_1 English(EN) ·

    Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

    Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization …

  237. arXiv cs.AI TIER_1 English(EN) · Pavan Kapanipathi ·

    Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

    Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), a…

  238. arXiv cs.LG TIER_1 English(EN) · Bettina Könighofer ·

    Easy-to-Use Shielding for Reinforcement Learning

    Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decis…

  239. arXiv cs.AI TIER_1 English(EN) · Yang Gao ·

    Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

    Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, w…

  240. arXiv cs.LG TIER_1 English(EN) · Siegfried Mercelis ·

    Post-Hoc Robustness for Model-Based Reinforcement Learning

    To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an …

  241. Hugging Face Daily Papers TIER_1 English(EN) ·

    Post-Hoc Robustness for Model-Based Reinforcement Learning

    To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an …

  242. arXiv cs.LG TIER_1 English(EN) · Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters ·

    Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

    arXiv:2606.02194v1 Announce Type: new Abstract: Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL)…

  243. arXiv cs.LG TIER_1 English(EN) · Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang ·

    FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

    arXiv:2510.09222v3 Announce Type: replace Abstract: Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness…

  244. arXiv cs.LG TIER_1 English(EN) · Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang ·

    MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

    arXiv:2312.03644v3 Announce Type: replace Abstract: Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to i…

  245. arXiv cs.LG TIER_1 English(EN) · Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra ·

    RDA: Reward Design Agent for Reinforcement Learning

    arXiv:2606.01672v1 Announce Type: new Abstract: Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka,…

  246. arXiv cs.LG TIER_1 English(EN) · Bernd Frauenknecht, Devdutt Subhasish, Artur Eisele, Friedrich Solowjow, Sebastian Trimpe ·

    All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

    arXiv:2606.01363v1 Announce Type: new Abstract: Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccur…

  247. arXiv cs.LG TIER_1 English(EN) · Hikmet Simsir, Ozgur S. Oguz ·

    Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

    arXiv:2606.01151v1 Announce Type: new Abstract: Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance,…

  248. arXiv cs.LG TIER_1 English(EN) · Shao-An Yin ·

    Distributed GNEP Algorithms without Multiplier Sharing and Applications to Multi-Robot Coordination and Contextual Bandit-Based Active Learning

    arXiv:2606.00759v1 Announce Type: new Abstract: Recent advances in artificial intelligence have expanded the focus from classical optimization to include equilibrium analysis in noncooperative games. Many such games involve shared constraints, leading to Generalized Nash Equilibr…

  249. arXiv cs.CL TIER_1 English(EN) · Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen ·

    StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

    arXiv:2604.18401v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where toke…

  250. arXiv cs.CL TIER_1 English(EN) · V\'ictor Gallego ·

    Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

    arXiv:2603.19453v2 Announce Type: replace Abstract: We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LL…

  251. arXiv cs.CL TIER_1 English(EN) · Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin ·

    BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning

    arXiv:2602.03719v2 Announce Type: replace Abstract: Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly a…

  252. arXiv cs.CL TIER_1 English(EN) · Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Jie Ouyang, Yucong Luo, Yitong Zhou, Qi Liu, Enhong Chen ·

    Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

    arXiv:2511.14460v2 Announce Type: replace Abstract: Large language models (LLMs) have rapidly evolved from single-turn text generators into the foundation of increasingly capable agents. As these agents take on more complex reasoning, decision making, tool use, and long-horizon t…

  253. arXiv cs.CL TIER_1 English(EN) · Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang ·

    Deep Research as Rubric for Reinforcement Learning

    arXiv:2606.01091v1 Announce Type: new Abstract: Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- e…

  254. arXiv cs.CL TIER_1 English(EN) · Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang ·

    Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

    arXiv:2606.00755v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learnin…

  255. arXiv cs.AI TIER_1 English(EN) · Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang ·

    Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

    arXiv:2605.12969v3 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulat…

  256. arXiv cs.AI TIER_1 English(EN) · Dogan Urgun, Gokhan Gungor ·

    Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

    arXiv:2603.24324v4 Announce Type: replace-cross Abstract: Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient groundi…

  257. arXiv cs.AI TIER_1 English(EN) · Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng ·

    HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

    arXiv:2603.03741v2 Announce Type: replace-cross Abstract: To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inh…

  258. arXiv cs.AI TIER_1 English(EN) · Sam Dauncey, Roger Wattenhofer ·

    You Can Learn Tokenization End-to-End with Reinforcement Learning

    arXiv:2602.13940v2 Announce Type: replace-cross Abstract: Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown prom…

  259. arXiv cs.AI TIER_1 English(EN) · Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker ·

    Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

    arXiv:2602.02098v2 Announce Type: replace-cross Abstract: Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are i…

  260. arXiv cs.AI TIER_1 English(EN) · Hongyu Lin, Yuchen Li, Haoran Luo, Zhenghong Lin, Libo Zhang, Mingjie Xing, Yanjun Wu ·

    TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning

    arXiv:2508.12551v2 Announce Type: replace-cross Abstract: Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space, sparse performance feedback, and strong workload sensitivity. We present TuneAgent…

  261. arXiv cs.AI TIER_1 Deutsch(DE) · Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang ·

    MARFT: Multi-Agent Reinforcement Fine-Tuning

    arXiv:2504.16129v5 Announce Type: replace-cross Abstract: Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to s…

  262. arXiv cs.AI TIER_1 English(EN) · Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han ·

    LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

    arXiv:2605.18077v2 Announce Type: replace Abstract: Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state informa…

  263. arXiv cs.AI TIER_1 English(EN) · Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng ·

    On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

    arXiv:2603.12109v2 Announce Type: replace Abstract: Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task horizons. However, in active reasoning where agents must elicit new observations through …

  264. arXiv cs.AI TIER_1 English(EN) · Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai ·

    MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

    arXiv:2601.22900v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed sample…

  265. arXiv cs.AI TIER_1 English(EN) · Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao ·

    OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

    arXiv:2606.02031v1 Announce Type: cross Abstract: Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open a…

  266. arXiv cs.AI TIER_1 English(EN) · Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma ·

    Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

    arXiv:2606.01098v1 Announce Type: cross Abstract: Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, the…

  267. arXiv cs.AI TIER_1 English(EN) · Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu ·

    Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

    arXiv:2606.00780v1 Announce Type: cross Abstract: Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and poli…

  268. arXiv cs.AI TIER_1 English(EN) · Rui Zhang, Xinle Wu, Yao Lu ·

    CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

    arXiv:2606.00609v1 Announce Type: cross Abstract: Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capabilit…

  269. arXiv cs.AI TIER_1 English(EN) · Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu, Huanwei Di, Jiang Liu, Jialian Wu, Zhengzhong Liu, Zicheng Liu, Emad Barsoum, Dimitris N. Metaxas, Hongyi Wang ·

    PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

    arXiv:2606.00395v1 Announce Type: cross Abstract: Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A root cause is router drift, i.e., expert …

  270. arXiv cs.AI TIER_1 English(EN) · Jonathan Cola\c{c}o Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy ·

    Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

    arXiv:2606.00367v1 Announce Type: cross Abstract: Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that…

  271. arXiv cs.AI TIER_1 English(EN) · Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo ·

    Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

    arXiv:2606.00151v1 Announce Type: cross Abstract: In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy i…

  272. arXiv cs.AI TIER_1 English(EN) · Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han ·

    Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

    arXiv:2606.02373v1 Announce Type: new Abstract: Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actual…

  273. arXiv cs.AI TIER_1 English(EN) · Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai ·

    SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

    arXiv:2606.02355v1 Announce Type: new Abstract: Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, contex…

  274. arXiv cs.AI TIER_1 English(EN) · Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson ·

    Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

    arXiv:2606.02337v1 Announce Type: new Abstract: Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure a…

  275. arXiv cs.AI TIER_1 English(EN) · Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang ·

    Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

    arXiv:2606.02132v1 Announce Type: new Abstract: Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which…

  276. arXiv cs.AI TIER_1 English(EN) · Vignesh Subramanian, {\DJ}or{\dj}e \v{Z}ikeli\'c, Suguman Bansal ·

    Certificate-Guided Evaluation of Reinforcement Learning Generalization

    arXiv:2606.00840v1 Announce Type: new Abstract: This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, charact…

  277. arXiv cs.AI TIER_1 English(EN) · Edwin Hamel-De le Court, Thom Badings, Alessandro Abate, Francesco Belardinelli, Francesco Fabiano ·

    Robust Shielding for Safe Reinforcement Learning

    arXiv:2606.00270v1 Announce Type: new Abstract: Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant trans…

  278. Hugging Face Daily Papers TIER_1 English(EN) ·

    EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

    EvoTrainer autonomously evolves both language model policies and training harnesses through empirical feedback, demonstrating superior performance in complex reasoning and coding tasks compared to traditional handcrafted approaches.

  279. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Jiawei Han ·

    Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

    Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation …

  280. arXiv cs.AI TIER_1 English(EN) · Xunliang Cai ·

    SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

    Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Sel…

  281. arXiv cs.AI TIER_1 English(EN) · Anders Jonsson ·

    Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

    Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination…

  282. arXiv cs.LG TIER_1 English(EN) · Jan Peters ·

    Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

    Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these polici…

  283. arXiv cs.AI TIER_1 English(EN) · Liang Wang ·

    Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

    Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress use…

  284. Hugging Face Daily Papers TIER_1 English(EN) ·

    OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

    Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-trai…

  285. arXiv cs.CL TIER_1 English(EN) · Jianfeng Gao ·

    OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

    Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-trai…

  286. arXiv cs.LG TIER_1 English(EN) · Baptiste Debes, Tinne Tuytelaars ·

    Multivariate Distributional Reinforcement Learning Using Sliced Divergences

    arXiv:2605.31222v1 Announce Type: new Abstract: Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dime…

  287. arXiv cs.LG TIER_1 English(EN) · Faiq Shamass ·

    ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

    arXiv:2605.30612v1 Announce Type: cross Abstract: Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but in…

  288. arXiv cs.LG TIER_1 English(EN) · Giseung Park, Hyunyoung Nam, Woohyeon Byeon, Amir Leshem, Youngchul Sung ·

    Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion

    arXiv:2605.31388v1 Announce Type: new Abstract: Multi-Objective Reinforcement Learning (MORL) extends standard RL by optimizing policies with respect to multiple, often conflicting, objectives. While max-min MORL has emerged as an effective approach for promoting fairness, its ap…

  289. arXiv cs.LG TIER_1 English(EN) · Mateusz Odrowaz-Sypniewski, Jasmine Bayrooti, Ajay Shankar, Amanda Prorok ·

    Generalized Intention Modeling in Multi-Agent Reinforcement Learning

    arXiv:2605.31318v1 Announce Type: new Abstract: Modeling an opponent's intent is critical for effective decision-making in non-cooperative, competitive, and general-sum multi-agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived…

  290. arXiv cs.LG TIER_1 English(EN) · Franki Nguimatsia-Tiofack, Fabian Schramm, Th\'eotime Le Hellard, Justin Carpentier ·

    Survival Reinforcement Learning: Toward Scalable Self-Supervised RL

    arXiv:2605.31273v1 Announce Type: new Abstract: While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due t…

  291. arXiv cs.LG TIER_1 English(EN) · Tobias Lademann, Th\'eo Vincent, Jan Peters, Matthias Weigold ·

    The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

    arXiv:2605.31044v1 Announce Type: new Abstract: Reinforcement learning has shown promising results for optimizing the control of industrial energy systems, yet most existing studies remain limited to the application in simulation environments. We investigate the challenges of dep…

  292. arXiv cs.LG TIER_1 English(EN) · Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald ·

    Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

    arXiv:2605.30896v1 Announce Type: new Abstract: Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited…

  293. arXiv cs.LG TIER_1 English(EN) · Enoch Hyunwook Kang ·

    A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models

    arXiv:2605.30843v1 Announce Type: new Abstract: In the forward reinforcement-learning problem, the reward is fixed and known; the learner is asked to find a good policy or value function. Here we turn the question around. Given offline data generated by an expert, can we recover …

  294. arXiv cs.LG TIER_1 English(EN) · Ha Manh Bui, Metod Jazbec, Eric Nalisnick, Anqi Liu ·

    Efficient and Uncertainty-Aware Diffusion Framework for Offline-to-Online Reinforcement Learning

    arXiv:2605.30776v1 Announce Type: new Abstract: Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Ex…

  295. arXiv cs.CL TIER_1 English(EN) · Magnus J{\o}rgenv{\aa}g, David Kacz\'er, Lasse Ruttert, Marvin G\"ulhan, Lucie Flek, Florian Mai ·

    Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

    arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setti…

  296. arXiv cs.CL TIER_1 English(EN) · Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng ·

    The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

    arXiv:2605.30888v1 Announce Type: new Abstract: Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the pol…

  297. arXiv cs.AI TIER_1 English(EN) · Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han ·

    Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

    arXiv:2605.18024v2 Announce Type: replace-cross Abstract: Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered…

  298. arXiv cs.AI TIER_1 English(EN) · Franki Nguimatsia Tiofack, Fabian Schramm, Th\'eotime Le Hellard, Justin Carpentier ·

    SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

    arXiv:2604.17551v2 Announce Type: replace-cross Abstract: Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and su…

  299. arXiv cs.AI TIER_1 English(EN) · Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik ·

    REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

    arXiv:2603.17145v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typicall…

  300. arXiv cs.AI TIER_1 English(EN) · Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong ·

    HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

    arXiv:2602.16165v2 Announce Type: replace-cross Abstract: Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before rec…

  301. arXiv cs.AI TIER_1 English(EN) · Tomas Leroy-Stone ·

    Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

    arXiv:2605.31361v1 Announce Type: cross Abstract: In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong general…

  302. arXiv cs.AI TIER_1 English(EN) · Amir Esterhuysen, Anders Jonsson ·

    The Terminal Representation in Reinforcement Learning

    arXiv:2605.31289v1 Announce Type: cross Abstract: Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The …

  303. arXiv cs.AI TIER_1 English(EN) · Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael F\"arber, Xun Xiao, Volker Tresp, Yunpu Ma ·

    EchoRL: Reinforcement Learning via Rollout Echoing

    arXiv:2605.31228v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the…

  304. arXiv cs.AI TIER_1 English(EN) · Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang ·

    Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

    arXiv:2605.30903v1 Announce Type: cross Abstract: Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study r…

  305. arXiv cs.AI TIER_1 English(EN) · Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui ·

    DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

    arXiv:2605.30859v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long ta…

  306. arXiv cs.AI TIER_1 English(EN) · Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson ·

    Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

    arXiv:2605.30461v1 Announce Type: cross Abstract: We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have…

  307. arXiv cs.AI TIER_1 English(EN) · Rafael Bankosegger, Thomas Eiter, Johannes Oetsch ·

    Answer-Set-Programming-based Abstractions for Reinforcement Learning

    arXiv:2605.31444v1 Announce Type: new Abstract: Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are t…

  308. arXiv cs.AI TIER_1 English(EN) · Mustafa Anis Hussain, Xinle Wu, Yao Lu ·

    Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

    arXiv:2605.30824v1 Announce Type: new Abstract: Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or…

  309. arXiv cs.AI TIER_1 English(EN) · Ahmed Abouelazm, Felix Klingebiel, Philip Sch\"orner, J. Marius Z\"ollner ·

    Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

    arXiv:2605.30576v1 Announce Type: new Abstract: Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framewor…

  310. Hugging Face Daily Papers TIER_1 English(EN) ·

    OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

    OpenWebRL presents a framework for training visual web agents using online reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision.

  311. Hugging Face Daily Papers TIER_1 English(EN) ·

    Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

    A 20B search agent trained with reinforcement learning within a stateful search framework demonstrates superior retrieval performance across multiple domains by separating semantic decision-making from environmental bookkeeping.

  312. arXiv cs.AI TIER_1 English(EN) · Johannes Oetsch ·

    Answer-Set-Programming-based Abstractions for Reinforcement Learning

    Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Lea…

  313. arXiv cs.LG TIER_1 English(EN) · Youngchul Sung ·

    Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion

    Multi-Objective Reinforcement Learning (MORL) extends standard RL by optimizing policies with respect to multiple, often conflicting, objectives. While max-min MORL has emerged as an effective approach for promoting fairness, its applicability remains limited, particularly when c…

  314. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tomas Leroy-Stone ·

    Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

    In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent sett…

  315. arXiv cs.CL TIER_1 English(EN) · Florian Mai ·

    Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

    Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcem…

  316. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Amanda Prorok ·

    Generalized Intention Modeling in Multi-Agent Reinforcement Learning

    Modeling an opponent's intent is critical for effective decision-making in non-cooperative, competitive, and general-sum multi-agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived from episode information chosen a priori, such …

  317. arXiv cs.AI TIER_1 English(EN) · Anders Jonsson ·

    The Terminal Representation in Reinforcement Learning

    Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they …

  318. arXiv cs.LG TIER_1 English(EN) · Justin Carpentier ·

    Survival Reinforcement Learning: Toward Scalable Self-Supervised RL

    While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma inherent in c…

  319. arXiv cs.AI TIER_1 English(EN) · Yunpu Ma ·

    EchoRL: Reinforcement Learning via Rollout Echoing

    Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Sp…

  320. arXiv cs.LG TIER_1 English(EN) · Tinne Tuytelaars ·

    Multivariate Distributional Reinforcement Learning Using Sliced Divergences

    Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dimension or lose computational tractability, and th…

  321. arXiv cs.LG TIER_1 English(EN) · Yifu Zheng ·

    RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

    arXiv:2605.30154v1 Announce Type: new Abstract: Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite …

  322. arXiv cs.LG TIER_1 English(EN) · Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang ·

    Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition

    arXiv:2505.05968v3 Announce Type: replace Abstract: Offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts, particularly stemming from the high dimensionality of joint action spaces and the presence of out-of-distributio…

  323. arXiv cs.LG TIER_1 English(EN) · Feiyang Wu, Ye Zhao, Anqi Wu ·

    Distributional Inverse Reinforcement Learning

    arXiv:2510.03013v4 Announce Type: replace Abstract: We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a de…

  324. arXiv cs.LG TIER_1 English(EN) · Yuehu Gong, Zeyuan Wang, Yulin Chen, Shutong Ding, Qingyuan Zhou, Yanwei Fu ·

    Path-Space Mirror Descent for On-Policy Reinforcement Learning under the Generalized Schr\"odinger Bridge

    arXiv:2603.21621v2 Announce Type: replace Abstract: Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose ex…

  325. arXiv cs.AI TIER_1 English(EN) · James Rudd-Jones, Mirco Musolesi, Mar\'ia P\'erez-Ortiz ·

    On Distributional Reinforcement Learning in Chaotic Dynamical Systems

    arXiv:2605.30160v1 Announce Type: cross Abstract: Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamic…

  326. arXiv cs.AI TIER_1 English(EN) · Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang ·

    HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

    arXiv:2605.30201v1 Announce Type: cross Abstract: We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, w…

  327. arXiv cs.AI TIER_1 English(EN) · Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian ·

    BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

    arXiv:2605.30226v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to…

  328. arXiv cs.AI TIER_1 English(EN) · Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu ·

    Reinforcement Learning with Robust Rubric Rewards

    arXiv:2605.30244v1 Announce Type: cross Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, r…

  329. arXiv cs.AI TIER_1 English(EN) · Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen ·

    Offline Reinforcement Learning with Generative Trajectory Policies

    arXiv:2510.11499v2 Announce Type: replace-cross Abstract: Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow,…

  330. arXiv cs.AI TIER_1 English(EN) · Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng ·

    Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

    arXiv:2602.01058v2 Announce Type: replace-cross Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT pe…

  331. arXiv cs.CL TIER_1 English(EN) · Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du ·

    PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

    arXiv:2605.29582v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across mu…

  332. arXiv cs.CL TIER_1 English(EN) · Andy Q Han, David J. Chalmers, Pavel Izmailov ·

    How's it going? Reinforcement learning in language models recruits a functional welfare axis

    arXiv:2605.30232v1 Announce Type: cross Abstract: How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, rel…

  333. arXiv cs.LG TIER_1 English(EN) · Keru Chen ·

    Information-Directed Offline-to-Online Reinforcement Learning

    arXiv:2605.29405v1 Announce Type: new Abstract: Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for e…

  334. arXiv cs.LG TIER_1 English(EN) · Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi ·

    Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

    arXiv:2605.30056v1 Announce Type: cross Abstract: Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampli…

  335. arXiv cs.AI TIER_1 English(EN) · Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su ·

    SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

    arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize…

  336. arXiv cs.AI TIER_1 English(EN) · Matt Gorbett, Hossein Shirazi ·

    Label-Free Reinforcement Learning via Cross-Model Entropy

    arXiv:2605.29009v1 Announce Type: cross Abstract: Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness c…

  337. arXiv cs.AI TIER_1 English(EN) · Aalok Patwa ·

    Self-Play Reinforcement Learning under Imperfect Information in Big 2

    arXiv:2605.28863v1 Announce Type: cross Abstract: Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We deve…

  338. arXiv cs.AI TIER_1 English(EN) · Ritvik Rastogi, Vishal Singh, Tejas Chaudhari, Sandeep Varma ·

    Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

    arXiv:2605.28829v1 Announce Type: cross Abstract: Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models pe…

  339. arXiv cs.AI TIER_1 English(EN) · Geoffrey Bradway, Roger Creus Castanyer, Lorenz Wolf, Maxwill Lin, Matthew James Sargent, Augustine N. Mavor-Parker ·

    unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

    arXiv:2605.29115v1 Announce Type: cross Abstract: Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Pytho…

  340. arXiv cs.AI TIER_1 English(EN) · Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng ·

    Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

    arXiv:2605.29782v1 Announce Type: cross Abstract: Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an un…

  341. Hugging Face Daily Papers TIER_1 English(EN) ·

    The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

    SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives.

  342. arXiv cs.AI TIER_1 English(EN) · Dandan Tu ·

    Reinforcement Learning with Robust Rubric Rewards

    While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide …

  343. arXiv cs.CL TIER_1 English(EN) · Pavel Izmailov ·

    How's it going? Reinforcement learning in language models recruits a functional welfare axis

    How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language mode…

  344. arXiv cs.AI TIER_1 English(EN) · Wenzhao Lian ·

    BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

    Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding exe…

  345. arXiv cs.AI TIER_1 English(EN) · Haozhe Zhang ·

    HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

    We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the …

  346. arXiv cs.AI TIER_1 English(EN) · María Pérez-Ortiz ·

    On Distributional Reinforcement Learning in Chaotic Dynamical Systems

    Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains,…

  347. arXiv cs.LG TIER_1 English(EN) · Yifu Zheng ·

    RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

    Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper d…

  348. arXiv cs.LG TIER_1 English(EN) · Ye Shi ·

    Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

    Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables …

  349. arXiv cs.CL TIER_1 English(EN) · Jinsong Su ·

    SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

    Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly trigger…

  350. arXiv cs.CL TIER_1 English(EN) · James Cheng ·

    Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

    Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In thi…

  351. arXiv cs.LG TIER_1 English(EN) · Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz ·

    Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control

    arXiv:2601.15015v2 Announce Type: replace Abstract: Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical s…

  352. arXiv cs.AI TIER_1 English(EN) · Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang ·

    TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

    arXiv:2605.28699v1 Announce Type: new Abstract: Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to mu…

  353. arXiv cs.AI TIER_1 English(EN) · Yiran Pang, Zhen Ni, Xiangnan Zhong ·

    Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

    arXiv:2605.27385v1 Announce Type: cross Abstract: Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneo…

  354. arXiv cs.AI TIER_1 English(EN) · Gengyue Han, Yiheng Feng ·

    Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

    arXiv:2605.27659v1 Announce Type: cross Abstract: Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environ…

  355. arXiv cs.AI TIER_1 English(EN) · Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang ·

    ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

    arXiv:2605.28293v1 Announce Type: cross Abstract: Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such seque…

  356. arXiv cs.AI TIER_1 English(EN) · Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang ·

    OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    arXiv:2604.18530v2 Announce Type: replace Abstract: Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy d…

  357. arXiv cs.AI TIER_1 English(EN) · Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo ·

    ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

    arXiv:2602.02150v2 Announce Type: replace-cross Abstract: Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior …

  358. arXiv cs.AI TIER_1 English(EN) · Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang ·

    Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

    arXiv:2605.19444v2 Announce Type: replace-cross Abstract: Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most ref…

  359. arXiv cs.CL TIER_1 English(EN) · Jiapeng Zhu, Jianxiang Yu, Yibo Zhao, Chengcheng Han, Qi Gu, Xunliang Cai, Xiang Li, Weining Qian ·

    Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

    arXiv:2605.28424v1 Announce Type: new Abstract: Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer …

  360. arXiv cs.CL TIER_1 English(EN) · Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet \"Ust\"un, Beyza Ermis ·

    Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

    arXiv:2605.28561v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable:…

  361. arXiv cs.CL TIER_1 English(EN) · Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li ·

    Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

    arXiv:2602.05897v2 Announce Type: replace Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness halluc…

  362. arXiv cs.CL TIER_1 English(EN) · Siqi Guo, Ming Lin, Tianbao Yang ·

    DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation

    arXiv:2603.21465v2 Announce Type: replace Abstract: Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA ker…

  363. arXiv cs.LG TIER_1 English(EN) · Wendi Li, Shawn Im, Sharon Li ·

    Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement Learning

    arXiv:2605.27954v1 Announce Type: new Abstract: Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving the…

  364. arXiv cs.LG TIER_1 English(EN) · Kaiqiang Ke, Shenghong He, Chengdong Xu, Yuheng Luo, Xiangyuan Lan, Chao Yu ·

    Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

    arXiv:2605.28127v1 Announce Type: new Abstract: Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarc…

  365. arXiv cs.LG TIER_1 English(EN) · Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin ·

    Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

    arXiv:2605.28184v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraini…

  366. arXiv cs.LG TIER_1 English(EN) · Onno Eberhard, Claire Vernade, Michael Muehlebach ·

    Commit to the Bit: Reactive Reinforcement Learning Done Right

    arXiv:2605.28276v1 Announce Type: new Abstract: Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation…

  367. arXiv cs.LG TIER_1 English(EN) · Mingjie Hu, Jian-Qiang Hu, Enlu Zhou ·

    Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

    arXiv:2605.28675v1 Announce Type: new Abstract: Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified l…

  368. arXiv cs.LG TIER_1 English(EN) · Renye Yan, Yaozhong Gan, You Wu, Junliang Xing, Ling Liangn, Yeshang Zhu, Yimao Cai ·

    AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

    arXiv:2410.04498v2 Announce Type: replace Abstract: In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply s…

  369. arXiv cs.LG TIER_1 English(EN) · Amir Moeini, Minjae Kwon, Alper Kamil Bozkurt, Yuichi Motai, Rohan Chandra, Lu Feng, Shangtong Zhang ·

    Safe In-Context Reinforcement Learning

    arXiv:2509.25582v3 Announce Type: replace Abstract: In-context reinforcement learning (ICRL) is an emerging RL paradigm where an agent, after pretraining, can adapt to out-of-distribution test tasks without any parameter updates, instead relying on an expanding context of interac…

  370. arXiv cs.LG TIER_1 English(EN) · Xinyu Liu, Zixuan Xie, Shangtong Zhang ·

    Extensions of Robbins-Siegmund Theorem with Applications in Reinforcement Learning

    arXiv:2509.26442v2 Announce Type: replace Abstract: The Robbins-Siegmund theorem establishes the convergence of stochastic processes that are almost supermartingales and is one of the most commonly used approaches for analyzing stochastic iterative algorithms in stochastic approx…

  371. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Hui Xiong ·

    LLM-ALSO: LLM-Driven Adaptive Learning-Signal Optimization for Multi-Agent Reinforcement Learning

    Effective training-time guidance is central to multi-agent reinforcement learning (MARL), yet remains difficult in sparse-reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual d…

  372. Hugging Face Daily Papers TIER_1 English(EN) ·

    SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

    SAAS introduces a reinforcement learning framework that enhances agent self-awareness to reduce unnecessary searches in LLM-based question answering systems.

  373. arXiv cs.AI TIER_1 English(EN) · Wentao Zhang ·

    TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

    Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dil…

  374. arXiv cs.LG TIER_1 English(EN) · Enlu Zhou ·

    Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

    Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition i…

  375. arXiv cs.CL TIER_1 English(EN) · Beyza Ermis ·

    Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

    Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, response…

  376. arXiv cs.CL TIER_1 English(EN) · Weining Qian ·

    Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

    Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. …

  377. Hugging Face Daily Papers TIER_1 English(EN) ·

    Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

    Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance…

  378. Hugging Face Daily Papers TIER_1 English(EN) ·

    Commit to the Bit: Reactive Reinforcement Learning Done Right

    Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access non-Markovia…

  379. Hugging Face Daily Papers TIER_1 English(EN) ·

    Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

    Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introd…

  380. arXiv cs.AI TIER_1 English(EN) · Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Yinxi Li, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, … ·

    Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

    arXiv:2509.21882v3 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet wel…

  381. arXiv cs.AI TIER_1 English(EN) · Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee ·

    Rethinking the Trust Region in LLM Reinforcement Learning

    arXiv:2602.04879v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the…

  382. arXiv cs.AI TIER_1 English(EN) · Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen ·

    AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

    arXiv:2605.18592v2 Announce Type: replace-cross Abstract: Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such a…

  383. arXiv cs.CL TIER_1 English(EN) · Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang ·

    Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

    arXiv:2605.26952v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's …

  384. arXiv cs.CL TIER_1 English(EN) · Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza ·

    SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

    arXiv:2603.28730v2 Announce Type: replace-cross Abstract: Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL…

  385. arXiv cs.LG TIER_1 English(EN) · Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu ·

    Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

    arXiv:2605.26282v1 Announce Type: new Abstract: Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bia…

  386. arXiv cs.LG TIER_1 English(EN) · Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev ·

    Stochastic Decision Horizons for Constrained Reinforcement Learning

    arXiv:2602.04599v2 Announce Type: replace Abstract: We propose stochastic decision horizons (SDH), a theoretically grounded framework for solving constrained RL problems with every-step constraint satisfaction, a desirable property in many real-world applications. In SDH, a const…

  387. arXiv cs.LG TIER_1 English(EN) · Jingwei Song, Meng Chen, Jie Xiao, Qingnan Ren, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Zhisheng Chen, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Lynn Ai, Eric Yang, Tianyu Shi ·

    ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

    arXiv:2602.02192v5 Announce Type: replace Abstract: Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout executio…

  388. arXiv cs.LG TIER_1 English(EN) · Tingting Ni, Maryam Kamgarpour ·

    Constrained Meta Reinforcement Learning with Provable Test-Time Safety

    arXiv:2601.21845v2 Announce Type: replace Abstract: Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in im…

  389. arXiv cs.LG TIER_1 English(EN) · Yousef Koka, David Selby, Gerrit Gro{\ss}mann, Kathan Pandya, Sebastian Vollmer ·

    CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

    arXiv:2502.03946v5 Announce Type: replace Abstract: Data preprocessing is often paid little attention in machine learning, despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data prep…

  390. arXiv cs.LG TIER_1 English(EN) · Dhruv S. Kushwaha, Zoleikha A. Biron ·

    Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

    arXiv:2605.26452v1 Announce Type: cross Abstract: Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a prin…

  391. arXiv cs.LG TIER_1 English(EN) · Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu, Feng Hong, Xinmu Ge, Lin Yuan, Weichang Wu, Qiang Hu, Xiaolu Zhang, Jun Zhou, Jiangchao Yao ·

    Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

    arXiv:2605.26579v1 Announce Type: new Abstract: The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imb…

  392. arXiv cs.LG TIER_1 English(EN) · Barsat Khadka ·

    MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

    arXiv:2605.26343v1 Announce Type: new Abstract: Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new…

  393. arXiv cs.AI TIER_1 English(EN) · Yanfei Zhang, Xu Lin, Chenglin Wu ·

    StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

    arXiv:2605.27140v1 Announce Type: new Abstract: Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides…

  394. arXiv cs.AI TIER_1 English(EN) · Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua ·

    Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

    arXiv:2605.27209v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents of…

  395. arXiv cs.AI TIER_1 English(EN) · Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee ·

    Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

    arXiv:2605.27355v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoin…

  396. arXiv cs.AI TIER_1 English(EN) · Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An ·

    Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

    arXiv:2605.26684v1 Announce Type: cross Abstract: Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies…

  397. arXiv cs.AI TIER_1 English(EN) · Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao ·

    Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

    arXiv:2605.26958v1 Announce Type: cross Abstract: Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scor…

  398. arXiv cs.AI TIER_1 English(EN) · Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas ·

    Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

    arXiv:2510.01833v2 Announce Type: replace Abstract: Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reas…

  399. arXiv cs.AI TIER_1 English(EN) · Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti ·

    Continual Model-Based Reinforcement Learning with Hypernetworks

    arXiv:2009.11997v3 Announce Type: replace-cross Abstract: Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be statio…

  400. Hugging Face Daily Papers TIER_1 English(EN) ·

    Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

    Skill0.5 is a novel agentic reinforcement learning framework that combines general skill internalization with task-specific skill utilization through a dynamic, difficulty-aware router to improve performance in complex task environments.

  401. Hugging Face Daily Papers TIER_1 English(EN) ·

    ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

    Proactive recommender systems using reinforcement learning face challenges with gradient estimation bias and variance, which are addressed through stepwise reward centering and position-specific advantage estimation mechanisms.

  402. Hugging Face Daily Papers TIER_1 English(EN) ·

    Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

    Reinforcement Learning from Verifiable Rewards and Multi-Token Prediction are combined through optimal coefficient calibration to improve joint training performance in mathematical reasoning benchmarks.

  403. arXiv cs.AI TIER_1 English(EN) · Kimin Lee ·

    Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

    Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, c…

  404. arXiv cs.AI TIER_1 English(EN) · Tat-Seng Chua ·

    Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

    Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in…

  405. arXiv cs.AI TIER_1 English(EN) · Chenglin Wu ·

    StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

    Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically t…

  406. arXiv cs.AI TIER_1 English(EN) · Jiaxin Mao ·

    Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

    Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrat…

  407. arXiv cs.CL TIER_1 English(EN) · Jie Jiang ·

    Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

    Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fa…

  408. Hugging Face Daily Papers TIER_1 English(EN) ·

    Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

    The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubr…

  409. arXiv cs.AI TIER_1 English(EN) · Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He ·

    Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

    arXiv:2602.10090v3 Announce Type: replace Abstract: Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable e…

  410. arXiv cs.AI TIER_1 English(EN) · Pengyi Li, Jianye Hao, Hongyao Tang, Xian Fu, Yan Zheng, Ke Tang ·

    Bridging Evolutionary Algorithms and Reinforcement Learning: A Comprehensive Survey on Hybrid Algorithms

    arXiv:2401.11963v5 Announce Type: replace-cross Abstract: Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) for optimization, has demonstrated remarkable performance advancements. By fusing both approaches, E…

  411. arXiv cs.AI TIER_1 English(EN) · Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang ·

    Coupled Variational Reinforcement Learning for Language Model General Reasoning

    arXiv:2512.12576v3 Announce Type: replace-cross Abstract: While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing t…

  412. arXiv cs.AI TIER_1 English(EN) · Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang ·

    Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

    arXiv:2602.08499v2 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and sho…

  413. arXiv cs.AI TIER_1 English(EN) · Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li ·

    STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

    arXiv:2602.15620v5 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain sta…

  414. arXiv cs.AI TIER_1 English(EN) · Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang ·

    Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

    arXiv:2603.18444v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often s…

  415. arXiv cs.AI TIER_1 English(EN) · Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao ·

    Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

    arXiv:2604.17328v2 Announce Type: replace-cross Abstract: This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insuff…

  416. arXiv cs.CL TIER_1 English(EN) · Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang ·

    DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

    arXiv:2605.25604v1 Announce Type: new Abstract: Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal…

  417. arXiv cs.CL TIER_1 English(EN) · Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung, Baojian Zhou ·

    Reinforcement Learning from Denoising Feedback

    arXiv:2605.25638v1 Announce Type: new Abstract: Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training para…

  418. arXiv cs.CL TIER_1 English(EN) · Wenlong Deng, Jiaji Huang, Kaan Ozkara, Yushu Li, Christos Thrampoulidis, Xiaoxiao Li, Youngsuk Park ·

    Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

    arXiv:2605.25189v1 Announce Type: cross Abstract: Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and arg…

  419. arXiv cs.CL TIER_1 English(EN) · Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin ·

    When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

    arXiv:2605.25864v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for rew…

  420. arXiv cs.CL TIER_1 English(EN) · Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Chen Qian, Zhiyuan Liu, Maosong Sun ·

    CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

    arXiv:2602.02979v3 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through superv…

  421. arXiv cs.LG TIER_1 English(EN) · Meichen Song, Yuhao Wang, Enlu Zhou ·

    Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

    arXiv:2605.24345v1 Announce Type: new Abstract: In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this ti…

  422. arXiv cs.LG TIER_1 English(EN) · Noah Farr, Aryaman Reddi, Carlo D'Eramo, Jan Peters ·

    Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning

    arXiv:2605.24709v1 Announce Type: new Abstract: Streaming reinforcement learning has emerged as an online learning paradigm that conforms to the restrictions of natural learning agents that process data incrementally, i.e. with a batch size of 1 and no replay buffer. While stream…

  423. arXiv cs.LG TIER_1 English(EN) · Amogh Palasamudram, Jakub Svoboda, Suguman Bansal, Krishnendu Chatterjee ·

    Reinforcement Learning for Reachability: Guaranteeing Asymptotic Optimality

    arXiv:2605.24740v1 Announce Type: new Abstract: Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, t…

  424. arXiv cs.LG TIER_1 English(EN) · Zuyuan Zhang ·

    A Contractive Feedback Semantics for Reinforcement Learning

    arXiv:2605.24759v1 Announce Type: new Abstract: Discounted reinforcement learning is usually presented through Bellman equations on closed Markov decision processes. This paper develops a compositional view: a one-step decision process is treated as an open stochastic component, …

  425. arXiv cs.LG TIER_1 English(EN) · Zhongjian Qiao, Jiafei Lyu, Chenjia Bai, Peisong Wang, Siyang Gao, Shuang Qiu ·

    Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets

    arXiv:2605.24862v1 Announce Type: new Abstract: Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source d…

  426. arXiv cs.LG TIER_1 English(EN) · Shruti Mishra, Michael Chang, Vamsi Spandan, Shmuel M. Rubinstein ·

    A perspective on fluid mechanical environments for challenges in reinforcement learning

    arXiv:2605.25011v1 Announce Type: new Abstract: We consider the challenge of developing agents that efficiently interact with high-dimensional, evolving environments, towards a view of practical reinforcement learning (RL) agents interacting with open worlds, of which they witnes…

  427. arXiv cs.LG TIER_1 English(EN) · Hyungkyu Kang, Byeongchan Kim, Min-hwan Oh ·

    Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning

    arXiv:2605.25740v1 Announce Type: new Abstract: Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remai…

  428. arXiv cs.LG TIER_1 English(EN) · Zhaoyu Zhu, Rui Gao, Shuang Li ·

    Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

    arXiv:2605.26078v1 Announce Type: new Abstract: Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state…

  429. arXiv cs.LG TIER_1 English(EN) · Jayprakash S. Nair, Jimson Mathew, Shivashankar B. Nair ·

    A Reinforcement Learning Inspired Latent Yield Based Adaptive Algorithm Switching Mechanism

    arXiv:2605.24436v1 Announce Type: cross Abstract: Selecting the most suitable algorithm for a given problem instance remains a challenging task, particularly in online or dynamic environments where problem characteristics evolve over time. Relying solely on instantaneous performa…

  430. arXiv cs.LG TIER_1 English(EN) · Rei Higuchi, Ryotaro Kawata, Akifumi Wachi, Shokichi Takakura, Kohei Miyaguchi, Taiji Suzuki ·

    How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

    arXiv:2605.24749v1 Announce Type: cross Abstract: Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study t…

  431. arXiv cs.LG TIER_1 English(EN) · Jingyi Li, Peng Wu, Chengchun Shi ·

    Counterfactually Safe Reinforcement Learning

    arXiv:2605.25114v1 Announce Type: cross Abstract: Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety conc…

  432. arXiv cs.LG TIER_1 English(EN) · Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski ·

    Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

    arXiv:2510.07257v2 Announce Type: replace Abstract: Offline goal-conditioned reinforcement learning (GCRL) often struggles with long-horizon tasks, where errors in value estimation accumulate and produce unreliable policies. It is typically assumed that effective long-term planni…

  433. arXiv cs.AI TIER_1 English(EN) · Lei Ding, Bin He, Chenguang Wang, Yang Liu ·

    ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

    arXiv:2605.24900v1 Announce Type: new Abstract: Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instru…

  434. arXiv cs.LG TIER_1 English(EN) · Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai ·

    GeMPO: Generalized Measure Matching for Online Diffusion Reinforcement Learning

    arXiv:2603.10250v2 Announce Type: replace Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over samples from the behavior policy, which often induces an overgreedy policy and fails to utilize feedback from negative samples. In …

  435. arXiv cs.AI TIER_1 English(EN) · Chengwei Li, Junlin Liu, Yang Gao ·

    Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

    arXiv:2605.25091v1 Announce Type: new Abstract: As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state …

  436. arXiv cs.AI TIER_1 English(EN) · Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu ·

    IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

    arXiv:2605.23997v1 Announce Type: cross Abstract: Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visua…

  437. arXiv cs.AI TIER_1 English(EN) · Changling Li, Ying Li ·

    Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

    arXiv:2605.24992v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone netw…

  438. arXiv cs.AI TIER_1 English(EN) · Sohaib Lafifi ·

    Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies

    arXiv:2605.25235v1 Announce Type: cross Abstract: We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via LP-relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility mo…

  439. arXiv cs.AI TIER_1 English(EN) · Minjae Kwon, Amir Moeini, Shangtong Zhang, Lu Feng ·

    Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning

    arXiv:2605.25267v1 Announce Type: cross Abstract: Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pret…

  440. arXiv cs.AI TIER_1 English(EN) · Aleksandar Todorov, Matthia Sabatelli ·

    Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

    arXiv:2605.26012v1 Announce Type: cross Abstract: Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we presen…

  441. arXiv cs.AI TIER_1 English(EN) · In-Chang Baek, Sung-Hyun Kim, Sam Earle, Zehua Jiang, Jin-Ha Noh, Julian Togelius, Kyung-Joong Kim ·

    PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

    arXiv:2502.10906v2 Announce Type: replace Abstract: Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and cont…

  442. arXiv cs.AI TIER_1 English(EN) · Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, S… ·

    Agent Learning via Early Experience

    arXiv:2510.08558v3 Announce Type: replace Abstract: A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning re…

  443. arXiv cs.AI TIER_1 English(EN) · Yuheng Jing, Kai Li, Ziwen Zhang, Jiajun Zhang, Zeyao Ma, Jiaxi Yang, Lei Zhang, Zhe Wu, Jinmin He, Junliang Xing, Jian Cheng ·

    Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

    arXiv:2605.24423v1 Announce Type: new Abstract: In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To ri…

  444. arXiv cs.AI TIER_1 English(EN) · Lirong Che, Yuzhe yang, Peiwen lin, Chuang wang, Xueqian wang, Jian su ·

    DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

    arXiv:2605.24539v1 Announce Type: new Abstract: Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can …

  445. Hugging Face Daily Papers TIER_1 English(EN) ·

    Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

    NoisyAgent is an agentic training framework that incorporates environmental imperfections into agent learning to improve robustness in real-world stochastic settings.

  446. Hugging Face Daily Papers TIER_1 English(EN) ·

    Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

    Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling.

  447. Hugging Face Daily Papers TIER_1 English(EN) ·

    Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

    AKBE enhances LLM agent training by dynamically identifying when tools are needed versus when internal knowledge suffices, improving accuracy and reducing unnecessary tool usage through targeted supervisory signals.

  448. arXiv cs.LG TIER_1 English(EN) · Shuang Li ·

    Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

    Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the…

  449. Hugging Face Daily Papers TIER_1 English(EN) ·

    Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

    Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the…

  450. arXiv cs.AI TIER_1 English(EN) · Matthia Sabatelli ·

    Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

    Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prio…

  451. arXiv cs.LG TIER_1 English(EN) · Guojun Yin ·

    When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

    Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often…

  452. arXiv cs.LG TIER_1 English(EN) · Min-hwan Oh ·

    Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning

    Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remains challenging. In this paper, we identify erron…

  453. arXiv cs.CL TIER_1 English(EN) · Baojian Zhou ·

    Reinforcement Learning from Denoising Feedback

    Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollo…

  454. arXiv cs.CL TIER_1 English(EN) · Yuewei Zhang ·

    DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

    Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world …

  455. arXiv cs.AI TIER_1 English(EN) · Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama ·

    Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

    arXiv:2510.00915v4 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably…

  456. arXiv cs.AI TIER_1 English(EN) · Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto ·

    The Surprising Difficulty of Search in Model-Based Reinforcement Learning

    arXiv:2601.21306v2 Announce Type: replace-cross Abstract: This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, s…

  457. arXiv cs.AI TIER_1 English(EN) · Chenglin Li, Grant Ruan, Hua Geng ·

    Safe Reinforcement Learning with Preference-based Constraint Inference

    arXiv:2603.23565v2 Announce Type: replace-cross Abstract: Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constra…

  458. arXiv cs.CL TIER_1 English(EN) · Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang ·

    From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

    arXiv:2605.23382v1 Announce Type: new Abstract: Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different plann…

  459. arXiv cs.CL TIER_1 English(EN) · Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu ·

    ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

    arXiv:2605.23454v1 Announce Type: new Abstract: Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches…

  460. arXiv cs.LG TIER_1 English(EN) · Zitian Li, Wang Chi Cheung ·

    Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

    arXiv:2605.23182v1 Announce Type: new Abstract: Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enou…

  461. arXiv cs.AI TIER_1 English(EN) · Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Cl\'ement Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport ·

    Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

    arXiv:2605.23146v1 Announce Type: cross Abstract: Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate…

  462. arXiv cs.AI TIER_1 English(EN) · Yongyan Wen, Siyuan Li, Mingjian Fu, Yiqin Yang, Xun Wang, Peng Liu ·

    Curriculum reinforcement learning with measurable task representation learning

    arXiv:2605.23372v1 Announce Type: cross Abstract: In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challe…

  463. arXiv cs.AI TIER_1 English(EN) · Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu ·

    Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

    arXiv:2605.23415v1 Announce Type: cross Abstract: Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction …

  464. arXiv cs.AI TIER_1 English(EN) · Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, C\'edric Colas, Jakob Foerster ·

    Goal-Conditioned Agents that Learn Everything All at Once

    arXiv:2605.23551v1 Announce Type: cross Abstract: A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goa…

  465. arXiv cs.AI TIER_1 English(EN) · Elie Abboud, Oren Gal ·

    ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

    arXiv:2605.23562v1 Announce Type: cross Abstract: Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in t…

  466. arXiv cs.AI TIER_1 English(EN) · Jason Ross Brown, Edward James Young ·

    Understanding Goal Generalisation in Sequential Reinforcement Learning

    arXiv:2605.23565v1 Announce Type: cross Abstract: Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on…

  467. arXiv cs.AI TIER_1 English(EN) · Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li ·

    R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

    arXiv:2601.03715v2 Announce Type: replace-cross Abstract: Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks…

  468. Hugging Face Daily Papers TIER_1 English(EN) ·

    DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

    Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.

  469. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Ying Li ·

    Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

    Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to addres…

  470. Hugging Face Daily Papers TIER_1 English(EN) ·

    Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

    Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.

  471. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Shivashankar B. Nair ·

    A Reinforcement Learning Inspired Latent Yield Based Adaptive Algorithm Switching Mechanism

    Selecting the most suitable algorithm for a given problem instance remains a challenging task, particularly in online or dynamic environments where problem characteristics evolve over time. Relying solely on instantaneous performance metrics can result in a reactive and unstable …

  472. arXiv cs.AI TIER_1 English(EN) · Edward James Young ·

    Understanding Goal Generalisation in Sequential Reinforcement Learning

    Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for a…

  473. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Oren Gal ·

    ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

    Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strate…

  474. arXiv cs.AI TIER_1 English(EN) · Jakob Foerster ·

    Goal-Conditioned Agents that Learn Everything All at Once

    A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is us…

  475. arXiv cs.CL TIER_1 English(EN) · Dayiheng Liu ·

    ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

    Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manual…

  476. arXiv cs.AI TIER_1 English(EN) · Yanhua Yu ·

    Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

    Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotat…

  477. arXiv cs.CL TIER_1 English(EN) · Chao Wang ·

    From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

    Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across use…

  478. arXiv cs.AI TIER_1 English(EN) · Peng Liu ·

    Curriculum reinforcement learning with measurable task representation learning

    In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on …

  479. arXiv cs.LG TIER_1 English(EN) · D. Sorokin, A. Kostin, L. Savchenko, G. Gusev, A. V. Savchenko ·

    TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

    arXiv:2306.05905v2 Announce Type: replace Abstract: A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by…

  480. arXiv cs.LG TIER_1 English(EN) · Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada ·

    Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games

    arXiv:2602.10894v2 Announce Type: replace Abstract: Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularizati…

  481. arXiv cs.LG TIER_1 English(EN) · Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban ·

    Heterogeneous Agent Collaborative Reinforcement Learning

    arXiv:2603.02604v2 Announce Type: replace Abstract: We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi-agent on-policy optimization. …

  482. arXiv cs.LG TIER_1 English(EN) · Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao ·

    SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

    arXiv:2602.11210v4 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur s…

  483. arXiv cs.LG TIER_1 English(EN) · Rupak Majumdar, Nikhil Singh, Sadegh Soudjani ·

    Kernel-Based Safe Exploration in Deep Reinforcement Learning

    arXiv:2605.22207v1 Announce Type: cross Abstract: Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emph{barrier function…

  484. arXiv cs.LG TIER_1 English(EN) · Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster ·

    Abstraction for Offline Goal-Conditioned Reinforcement Learning

    arXiv:2605.22711v1 Announce Type: new Abstract: Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been…

  485. arXiv cs.LG TIER_1 English(EN) · Benjamin Poole, Andrew Quinn, Li Yang, Minwoo Lee ·

    Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

    arXiv:2605.22454v1 Announce Type: new Abstract: Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due t…

  486. arXiv cs.LG TIER_1 English(EN) · Wei Liu, Ting Long ·

    Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

    arXiv:2605.22376v1 Announce Type: new Abstract: Cross-domain offline reinforcement learning (CDRL) aims to improve policy learning in a target domain by leveraging data collected from a source domain. Existing works typically assess the transferability of source-domain data by me…

  487. arXiv cs.LG TIER_1 English(EN) · Stefan Huber, Hannes Unger, Georg Sch\"afer, Jakob Rehrl ·

    Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks

    arXiv:2605.22305v1 Announce Type: new Abstract: We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simpl…

  488. arXiv cs.LG TIER_1 English(EN) · Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt ·

    Hierarchical Variational Policies for Reward-Guided Diffusion

    arXiv:2605.21661v1 Announce Type: new Abstract: Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples…

  489. arXiv cs.CL TIER_1 English(EN) · Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi ·

    General Preference Reinforcement Learning

    arXiv:2605.18721v3 Announce Type: replace-cross Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a prog…

  490. arXiv cs.AI TIER_1 English(EN) · Xingwei Gan, Ying Zhu ·

    Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

    arXiv:2605.20555v1 Announce Type: cross Abstract: We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning…

  491. arXiv cs.AI TIER_1 English(EN) · Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter ·

    Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

    arXiv:2605.20740v1 Announce Type: cross Abstract: Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point est…

  492. arXiv cs.AI TIER_1 English(EN) · Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton ·

    Behavior-Consistent Deep Reinforcement Learning

    arXiv:2605.21214v2 Announce Type: cross Abstract: Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run…

  493. arXiv cs.AI TIER_1 English(EN) · Xiaocan Li, Shiliang Wu, Zheng Shen ·

    Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

    arXiv:2605.20402v1 Announce Type: cross Abstract: MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error…

  494. arXiv cs.AI TIER_1 English(EN) · Yonghyeon Jo, Sunwoo Lee, Seungyul Han ·

    Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

    arXiv:2602.17062v2 Announce Type: replace Abstract: Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts du…

  495. arXiv cs.AI TIER_1 English(EN) · Nasehatul Mustakim, Lucas Lehnert ·

    Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

    arXiv:2605.20272v1 Announce Type: cross Abstract: While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Dis…

  496. arXiv cs.AI TIER_1 English(EN) · Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu ·

    FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

    arXiv:2605.20256v1 Announce Type: cross Abstract: Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy up…

  497. arXiv cs.AI TIER_1 English(EN) · Andrew Choi, Wei Xu ·

    RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

    arXiv:2605.11151v2 Announce Type: replace Abstract: Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces wi…

  498. arXiv cs.AI TIER_1 English(EN) · Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han ·

    Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

    arXiv:2502.03752v5 Announce Type: replace-cross Abstract: Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable s…

  499. arXiv cs.AI TIER_1 English(EN) · Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han ·

    Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

    arXiv:2506.21039v3 Announce Type: replace-cross Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solution…

  500. arXiv cs.AI TIER_1 English(EN) · Carlo Romeo, Andrew D. Bagdanov ·

    ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

    arXiv:2605.19503v2 Announce Type: replace-cross Abstract: Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, how…

  501. arXiv cs.CL TIER_1 English(EN) · Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang ·

    From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

    arXiv:2605.22074v1 Announce Type: cross Abstract: Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit a…

  502. arXiv cs.CL TIER_1 English(EN) · Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao ·

    Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

    arXiv:2605.22177v1 Announce Type: cross Abstract: The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with th…

  503. arXiv cs.AI TIER_1 English(EN) · Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh ·

    Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

    arXiv:2605.20865v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local…

  504. arXiv cs.AI TIER_1 English(EN) · Jakob Foerster ·

    Abstraction for Offline Goal-Conditioned Reinforcement Learning

    Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal ab…

  505. arXiv cs.AI TIER_1 English(EN) · Minwoo Lee ·

    Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

    Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic…

  506. arXiv cs.CL TIER_1 English(EN) · Jianhua Tao ·

    Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

    The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottlene…

  507. arXiv cs.CL TIER_1 English(EN) · Gao Huang ·

    From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

    Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed at…

  508. Hugging Face Daily Papers TIER_1 English(EN) ·

    From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

    SCRL addresses inefficiencies in reinforcement learning from verifiable rewards by using subproblem-level normalization for finer credit assignment and curriculum learning, improving mathematical reasoning performance on challenging benchmarks.

  509. Hugging Face Daily Papers TIER_1 English(EN) ·

    Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

    A reinforcement learning-driven orchestration framework dynamically composes expert models and skills for multimodal tasks, achieving superior performance with low computational overhead.

  510. arXiv cs.CL TIER_1 English(EN) · Yankai Lin ·

    DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

    Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understo…

  511. arXiv cs.AI TIER_1 English(EN) · Eric Eaton ·

    Behavior-Consistent Deep Reinforcement Learning

    Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of b…

  512. arXiv cs.LG TIER_1 English(EN) · Mira Mezini ·

    Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

    Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for planning and executing actions, awarene…

  513. Hugging Face Daily Papers TIER_1 English(EN) ·

    Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

    Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data …

  514. Hugging Face Daily Papers TIER_1 English(EN) ·

    Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient object…

  515. arXiv cs.AI TIER_1 English(EN) · Min-hwan Oh ·

    Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

    Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient object…

  516. arXiv cs.AI TIER_1 English(EN) · Alan Ritter ·

    Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

    Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive dist…

  517. Hugging Face Daily Papers TIER_1 English(EN) ·

    DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

    Reinforcement learning from verifiable rewards is enhanced through a discriminative token credit assignment method that improves reward-based training by amplifying distinctive token-gradient directions and reducing noise from shared patterns.

  518. Hugging Face Daily Papers TIER_1 English(EN) ·

    Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

    MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct …

  519. arXiv cs.LG TIER_1 English(EN) · Julie Josse ·

    Set-Valued Policy Learning

    Conventional treatment policies map patient covariates to a single recommended intervention in order to maximize expected clinical outcomes. Although a rich body of causal inference methods has been developed to estimate such policies, point-valued recommendations can be highly s…

  520. arXiv cs.CL TIER_1 English(EN) · Han Li ·

    GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

    We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths,…

  521. Hugging Face Daily Papers TIER_1 English(EN) ·

    ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

    Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-t…

  522. Hugging Face Daily Papers TIER_1 English(EN) ·

    When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

    Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than g…

  523. Hugging Face Daily Papers TIER_1 English(EN) ·

    ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.

  524. arXiv cs.CL TIER_1 English(EN) · John M. Cioffi ·

    General Preference Reinforcement Learning

    Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, whil…

  525. Hugging Face Daily Papers TIER_1 English(EN) ·

    General Preference Reinforcement Learning

    Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, whil…

  526. arXiv cs.AI TIER_1 English(EN) · Zhiyu Chen ·

    AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

    Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollout…

  527. Hugging Face Daily Papers TIER_1 English(EN) ·

    AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

    Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollout…

  528. arXiv cs.AI TIER_1 English(EN) · Hendrik Baier ·

    Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

    Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architecture…

  529. Hugging Face Daily Papers TIER_1 English(EN) ·

    Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

    Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architecture…

  530. arXiv cs.AI TIER_1 English(EN) · Mark Fuge ·

    Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

    Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into th…

  531. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Seungyul Han ·

    LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

    Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-A…

  532. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Seungyul Han ·

    Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

    Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when …

  533. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jie Lu ·

    Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

    Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods r…

  534. arXiv cs.LG TIER_1 English(EN) · Liang Zheng ·

    BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control

    Real-world control systems frequently operate under \emph{piecewise stationary} conditions, where dynamics remain stable for extended periods before undergoing abrupt regime changes. Standard robust RL methods face a fundamental dilemma: a globally conservative policy wastes perf…

  535. arXiv cs.CL TIER_1 English(EN) · José A. R. Fonallosa ·

    Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

    Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply…

  536. arXiv cs.LG TIER_1 English(EN) · Zihan Zhang ·

    Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

    We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^…

  537. arXiv cs.CL TIER_1 English(EN) · Zhouxing Shi ·

    GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

    Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (…

  538. arXiv cs.AI TIER_1 English(EN) · Yongliang Shen ·

    Self-Distilled Agentic Reinforcement Learning

    Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level gui…

  539. Hugging Face Daily Papers TIER_1 English(EN) ·

    Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

    Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where corre…

  540. arXiv cs.AI TIER_1 English(EN) · Yu-Xiong Wang ·

    Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

    Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where corre…

  541. arXiv cs.LG TIER_1 English(EN) · Min-hwan Oh ·

    Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning

    We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($λ$) (CPQL). Our algorithm adapts the Peng's Q($λ$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, th…

  542. arXiv cs.CL TIER_1 English(EN) · Qitian Wu ·

    Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

    Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to unifor…

  543. arXiv cs.CL TIER_1 English(EN) · Yaojie Lu ·

    Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

    Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optim…

  544. Hugging Face Daily Papers TIER_1 English(EN) ·

    ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

    Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely o…

  545. arXiv cs.CL TIER_1 English(EN) · Xunliang Cai ·

    Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

    Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we …

  546. arXiv cs.AI TIER_1 English(EN) · Ahmed Khalifa ·

    Learning Local Constraints for Reinforcement-Learned Content Generators

    Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforce…

  547. arXiv cs.AI TIER_1 English(EN) · Arnu Pretorius ·

    Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

    Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL…

  548. arXiv cs.AI TIER_1 English(EN) · Minjoon Seo ·

    Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization…

  549. Hugging Face Daily Papers TIER_1 English(EN) ·

    Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

    Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly whe…

  550. Hugging Face Daily Papers TIER_1 English(EN) ·

    ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

    Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and canno…

  551. Hugging Face Daily Papers TIER_1 English(EN) ·

    Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

    Inverse reinforcement learning (IRL), which infers reward functions from demonstrations, is a valuable tool for modeling and understanding decision-making behavior. Many variants of IRL have been developed to capture complexities of human decision-making, such as subjective belie…

  552. arXiv cs.AI TIER_1 English(EN) · Yunzhong He ·

    Reward Hacking in Rubric-Based Reinforcement Learning

    Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verif…

  553. arXiv cs.LG TIER_1 English(EN) · Amanda Prorok ·

    Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

    Effective multi-agent cooperation requires agents to adopt diverse behaviors as task conditions evolve-and to do so at the right moment. Yet, current Multi-Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind f…

  554. arXiv cs.AI TIER_1 English(EN) · Alexander J. Smola ·

    Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

    Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sam…

  555. arXiv cs.AI TIER_1 English(EN) · Peizhong Ju ·

    Discrete Flow Matching for Offline-to-Online Reinforcement Learning

    Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is its…

  556. arXiv cs.AI TIER_1 English(EN) · Shaowu Yang ·

    Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

    Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce th…

  557. arXiv cs.LG TIER_1 English(EN) · Shaowu Yang ·

    Delay-Empowered Causal Hierarchical Reinforcement Learning

    Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their…

  558. arXiv cs.AI TIER_1 English(EN) · Abhishek Gupta ·

    TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

    Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a uni…

  559. arXiv cs.LG TIER_1 English(EN) · Jamison Heard ·

    Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

    Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the envir…

  560. arXiv cs.LG TIER_1 English(EN) · Guillaume Drion ·

    On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

    In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observa…

  561. arXiv cs.CL TIER_1 English(EN) · Fuli Feng ·

    SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

    Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent m…

  562. arXiv cs.CL TIER_1 English(EN) · Xiangxiang Chu ·

    Learning Agentic Policy from Action Guidance

    Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional traini…

  563. arXiv cs.CL TIER_1 English(EN) · Xuanjing Huang ·

    Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level me…

  564. arXiv cs.CL TIER_1 English(EN) · Hong Cheng ·

    Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance…

  565. arXiv cs.AI TIER_1 English(EN) · Nicholas Bambos ·

    Policy Gradient Methods for Non-Markovian Reinforcement Learning

    We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provid…

  566. Hugging Face Daily Papers TIER_1 English(EN) ·

    Policy Gradient Methods for Non-Markovian Reinforcement Learning

    We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provid…

  567. arXiv cs.LG TIER_1 English(EN) · Jan Peters ·

    XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

    For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with spars…

  568. Hugging Face Daily Papers TIER_1 English(EN) ·

    Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

    In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a…

  569. arXiv cs.AI TIER_1 English(EN) · Nils Jansen ·

    Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

    In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a…

  570. arXiv cs.AI TIER_1 English(EN) · Michal Nauman ·

    When Does Non-Uniform Replay Matter in Reinforcement Learning?

    Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed…

  571. 量子位 (QbitAI) TIER_1 中文(ZH) · 闻乐 ·

    Reinforcement Learning Without Parameter Updates! OpenAI's Jia-Yi Ong Proposes a New Paradigm: Decision-Making Only Requires an AI-Handcrafted .py File

    实现过程开源可复现

  572. arXiv cs.LG TIER_1 English(EN) · Sanjay Bhat ·

    Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

    Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied i…

  573. arXiv cs.LG TIER_1 English(EN) · Daniel Murfet ·

    Interpreting Reinforcement Learning Agents with Susceptibilities

    Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate …

  574. arXiv cs.AI TIER_1 Deutsch(DE) · Minhyuk Sung ·

    Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

    We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probabi…

  575. arXiv cs.CL TIER_1 English(EN) · Yohan Jo ·

    Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its em…

  576. arXiv cs.LG TIER_1 English(EN) · Hao Chen ·

    LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

    Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supe…

  577. arXiv cs.CL TIER_1 English(EN) · Miaohui Wang ·

    ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

    Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penal…

  578. arXiv cs.CL TIER_1 English(EN) · Yanghua Xiao ·

    SEIF: Self-Evolving Reinforcement Learning for Instruction Following

    Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training …

  579. arXiv cs.LG TIER_1 English(EN) · Shangtong Zhang ·

    Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

    In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the …

  580. arXiv cs.CL TIER_1 English(EN) · Stefano Soatto ·

    Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Wo…

  581. arXiv cs.LG TIER_1 English(EN) · Tim Walter, Hannah Markgraf, Jonathan K\"ulz, Matthias Althoff ·

    Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

    arXiv:2506.01665v4 Announce Type: replace Abstract: The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These saf…

  582. arXiv cs.LG TIER_1 English(EN) · David Leeftink, Max Hinne, Marcel van Gerven ·

    Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

    arXiv:2605.05373v1 Announce Type: new Abstract: A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement le…

  583. arXiv cs.LG TIER_1 English(EN) · Dillon Sandhu, Ronald Parr ·

    Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

    arXiv:2605.05481v1 Announce Type: new Abstract: We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is u…

  584. arXiv cs.LG TIER_1 English(EN) · Nandiraju Gireesh, Yuanliang Ju, He Wang ·

    Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

    arXiv:2605.05544v1 Announce Type: new Abstract: Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal:…

  585. arXiv cs.LG TIER_1 English(EN) · Cristiano da Costa Cunha, Ajmal Mian, Tim French, Wei Liu ·

    Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

    arXiv:2605.06066v1 Announce Type: new Abstract: Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium …

  586. arXiv cs.LG TIER_1 English(EN) · Alireza Modirshanechi, Benjamin Eysenbach, Peter Dayan, Eric Schulz ·

    Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

    arXiv:2605.06145v1 Announce Type: new Abstract: Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information s…

  587. arXiv cs.LG TIER_1 English(EN) · Yaomin Wang, Jianting Pan, Ran Tian, Xiaoyang Li, Yu Zhang, Hengle Qin, Tianshu YU ·

    AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

    arXiv:2605.06149v1 Announce Type: new Abstract: The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is …

  588. arXiv cs.LG TIER_1 English(EN) · Hyunjun Na, Donghwan Lee ·

    Soft Deterministic Policy Gradient with Gaussian Smoothing

    arXiv:2605.06228v1 Announce Type: new Abstract: Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in pra…

  589. arXiv cs.LG TIER_1 English(EN) · Zuyuan Zhang, Fei Xu Yu, Tian Lan ·

    Operator-Guided Invariance Learning for Continuous Reinforcement Learning

    arXiv:2605.06500v1 Announce Type: new Abstract: Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve …

  590. arXiv cs.LG TIER_1 English(EN) · Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua ·

    On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

    arXiv:2605.06523v1 Announce Type: new Abstract: Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated…

  591. arXiv cs.LG TIER_1 English(EN) · Dmitri Goloubentsev, Natalija Karpichina ·

    SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

    arXiv:2605.06570v1 Announce Type: new Abstract: Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor …

  592. arXiv cs.LG TIER_1 English(EN) · Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song ·

    Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

    arXiv:2605.05262v1 Announce Type: cross Abstract: We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnos…

  593. arXiv cs.LG TIER_1 English(EN) · Haodong Liang, Lifeng Lai ·

    Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

    arXiv:2605.05755v1 Announce Type: cross Abstract: We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-at…

  594. arXiv cs.LG TIER_1 English(EN) · Maria Ana Cardei, Matthew Landers, Afsaneh Doryab ·

    Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

    arXiv:2605.06557v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, pa…

  595. arXiv cs.LG TIER_1 English(EN) · David M\"uller, Agon Serifi, Sammy Christen, Ruben Grandia, Espen Knoop, Moritz B\"acher ·

    ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

    arXiv:2605.06593v1 Announce Type: cross Abstract: Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motio…

  596. arXiv cs.LG TIER_1 English(EN) · Shuo Liu, Xinzichen Li, Christopher Amato ·

    Cross-Modal Navigation with Multi-Agent Reinforcement Learning

    arXiv:2605.06595v1 Announce Type: cross Abstract: Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal…

  597. arXiv cs.LG TIER_1 English(EN) · Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Yanfeng Wang, Siheng Chen ·

    AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

    arXiv:2602.07906v5 Announce Type: replace Abstract: Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behaviora…

  598. arXiv cs.LG TIER_1 English(EN) · Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li ·

    Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

    arXiv:2603.15646v2 Announce Type: replace Abstract: Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, m…

  599. arXiv cs.LG TIER_1 English(EN) · Jiaxin Liu, Anzhe Cheng, Paul Bogdan ·

    Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

    arXiv:2603.18257v2 Announce Type: replace Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observ…

  600. arXiv cs.LG TIER_1 English(EN) · Naveen Mysore ·

    Prediction-Based Markov Violation Scores for Detecting Non-Markovian Observations in Reinforcement Learning

    arXiv:2603.27389v2 Announce Type: replace Abstract: Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance …

  601. arXiv cs.LG TIER_1 English(EN) · Yuan Zhuang, Yuexin Bian, Sihong He, Jie Feng, Qing Su, Songyang Han, Jonathan Petit, Shihao Ji, Yuanyuan Shi, Fei Miao ·

    Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

    arXiv:2604.18978v2 Announce Type: replace Abstract: Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training…

  602. arXiv cs.CL TIER_1 English(EN) · Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen ·

    Milestone-Guided Policy Learning for Long-Horizon Language Agents

    arXiv:2605.06078v1 Announce Type: new Abstract: While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where corr…

  603. arXiv cs.CL TIER_1 English(EN) · Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang ·

    A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

    arXiv:2605.06200v1 Announce Type: new Abstract: Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions.…

  604. arXiv cs.CL TIER_1 English(EN) · Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin ·

    StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    arXiv:2605.06642v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and…

  605. arXiv cs.CL TIER_1 English(EN) · Mingwei Xu, Hao Fang ·

    Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    arXiv:2605.06650v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change …

  606. arXiv cs.AI TIER_1 English(EN) · Yinbo Yu, Xueyu Yin, Jiadai Wang, Chunwei Tian, Sai Xu, Qi Zhu, Daoqiang Zhang ·

    BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

    arXiv:2605.05977v1 Announce Type: new Abstract: Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger pattern…

  607. arXiv cs.AI TIER_1 English(EN) · Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang ·

    Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    arXiv:2605.06130v1 Announce Type: new Abstract: A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, a…

  608. arXiv cs.AI TIER_1 English(EN) · Haochen Cai, Xian Yu ·

    Learning to Cut: Reinforcement Learning for Benders Decomposition

    arXiv:2605.06516v1 Announce Type: cross Abstract: Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem…

  609. arXiv cs.AI TIER_1 English(EN) · Claudio Fanconi, Nicol\'as Astorga, Mihaela van der Schaar ·

    Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

    arXiv:2510.01857v4 Announce Type: replace Abstract: Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or definin…

  610. arXiv cs.CL TIER_1 English(EN) · Hao Fang ·

    Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to G…

  611. arXiv cs.AI TIER_1 English(EN) · Zhenfei Yin ·

    StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. I…

  612. Hugging Face Daily Papers TIER_1 English(EN) ·

    Cross-Modal Navigation with Multi-Agent Reinforcement Learning

    Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substan…

  613. arXiv cs.AI TIER_1 English(EN) · Christopher Amato ·

    Cross-Modal Navigation with Multi-Agent Reinforcement Learning

    Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substan…

  614. arXiv cs.LG TIER_1 English(EN) · Moritz Bächer ·

    ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

    Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We…

  615. arXiv cs.LG TIER_1 English(EN) · Natalija Karpichina ·

    SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

    Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instance…

  616. arXiv cs.AI TIER_1 English(EN) · Afsaneh Doryab ·

    Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

    Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and jo…

  617. arXiv cs.AI TIER_1 English(EN) · Tat-Seng Chua ·

    On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

    Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-…

  618. arXiv cs.AI TIER_1 English(EN) · Xian Yu ·

    Learning to Cut: Reinforcement Learning for Benders Decomposition

    Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this …

  619. arXiv cs.AI TIER_1 English(EN) · Tian Lan ·

    Operator-Guided Invariance Learning for Continuous Reinforcement Learning

    Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on spec…

  620. arXiv cs.CL TIER_1 English(EN) · Jie Jiang ·

    A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

    Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assi…

  621. arXiv cs.CL TIER_1 English(EN) · Yongliang Shen ·

    Milestone-Guided Policy Learning for Long-Horizon Language Agents

    While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal …

  622. arXiv cs.AI TIER_1 English(EN) · Karthik Soma, Yann Bouteiller, Heiko Hamann, Giovanni Beltrame ·

    The Hive Mind is a Single Reinforcement Learning Agent

    arXiv:2410.17517v5 Announce Type: replace-cross Abstract: Decision-making is an essential attribute of any intelligent agent or group. Natural systems are known to converge to effective strategies through at least two distinct mechanisms: collective decision-making via imitation …

  623. arXiv cs.LG TIER_1 English(EN) · Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang ·

    EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

    arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity …

  624. arXiv cs.LG TIER_1 English(EN) · Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, Miroslav Pajic, Yuichi Motai ·

    Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

    arXiv:2605.05123v1 Announce Type: new Abstract: In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline,…

  625. arXiv cs.LG TIER_1 English(EN) · Shawn Ray ·

    Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

    arXiv:2605.05020v1 Announce Type: new Abstract: System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-S…

  626. arXiv cs.LG TIER_1 English(EN) · Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, Shivaram Kalyanakrishnan ·

    On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

    arXiv:2605.04979v1 Announce Type: cross Abstract: A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of de…

  627. arXiv cs.LG TIER_1 English(EN) · Xiyan Fu, Wei Liu ·

    Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

    arXiv:2605.04920v1 Announce Type: new Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target …

  628. arXiv cs.LG TIER_1 English(EN) · Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka ·

    A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

    arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular i…

  629. arXiv cs.LG TIER_1 English(EN) · Lirui Luo, Guoxi Zhang, Hongming Xu, Cong Fang, Qing Li ·

    SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

    arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes o…

  630. arXiv cs.LG TIER_1 English(EN) · Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama ·

    Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

    arXiv:2605.04477v1 Announce Type: new Abstract: Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in t…

  631. arXiv cs.LG TIER_1 English(EN) · Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao, Hao Cheng, Sifa Zheng ·

    CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

    arXiv:2605.04470v1 Announce Type: new Abstract: Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade…

  632. arXiv cs.LG TIER_1 English(EN) · Senne Deproost, Mehrdad Asadi, Ann Now\'e ·

    Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

    arXiv:2605.04254v1 Announce Type: new Abstract: We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with…

  633. arXiv cs.LG TIER_1 English(EN) · Qijun Liao, Zhaoxin Yu, Jue Yang ·

    Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

    arXiv:2605.04185v1 Announce Type: new Abstract: When deploying reinforcement learning policies to physical robots, actuator rate constraints -- hard limits on how fast each joint can move per control step -- are unavoidable. These limits vary substantially across joints due to di…

  634. arXiv cs.LG TIER_1 English(EN) · Bilel Abderrahmane Benziane, Benoit Lardeux, Ayoub Mcharek, Maher Jridi ·

    Designing a double deep reinforcement learning selection tool for resilient demand prediction

    arXiv:2605.04068v1 Announce Type: new Abstract: The use of artificial intelligence in supply chain forecasting has attracted many scientific studies for several decades. However, the process of selecting an appropriate forecasting solution becomes a daunting task. This complexity…

  635. arXiv cs.CL TIER_1 English(EN) · Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang ·

    Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

    arXiv:2512.15146v4 Announce Type: replace Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for impr…

  636. arXiv cs.LG TIER_1 English(EN) · Bj\"orn Hoppmann, Christoph Scholz ·

    Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent

    arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning over…

  637. arXiv cs.LG TIER_1 English(EN) · Xiaoyuan Cheng, Wenxuan Yuan, Boyang Li, Yuanchao Xu, Yiming Yang, Hao Liang, Bei Peng, Robert Loftin, Zhuo Sun, Yukun Hu ·

    How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

    arXiv:2602.02924v2 Announce Type: replace Abstract: Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline set…

  638. arXiv cs.LG TIER_1 English(EN) · Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang ·

    On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

    arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs …

  639. arXiv cs.LG TIER_1 English(EN) · Peter N. Loxley ·

    Optimal Control with Natural Images: Efficient Reinforcement Learning using Overcomplete Sparse Codes

    arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this…

  640. arXiv cs.AI TIER_1 English(EN) · Thomas Weng ·

    When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously l…

  641. arXiv cs.AI TIER_1 English(EN) · Yuichi Motai ·

    Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

    In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are …

  642. arXiv cs.AI TIER_1 English(EN) · Gabriel Nelson ·

    LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts

    Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framewor…

  643. arXiv cs.LG TIER_1 English(EN) · Shawn Ray ·

    Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

    System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-SND, which replaces this complete-graph average w…

  644. arXiv cs.AI TIER_1 English(EN) · Shivaram Kalyanakrishnan ·

    On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

    A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect rec…

  645. arXiv cs.AI TIER_1 English(EN) · Zhisheng Yang ·

    EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

    Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, …

  646. arXiv cs.AI TIER_1 English(EN) · Gal A. Kaminka ·

    Modular Reinforcement Learning For Cooperative Swarms

    A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement l…

  647. arXiv cs.CL TIER_1 English(EN) · Wei Liu ·

    Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

    Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fail…

  648. arXiv cs.AI TIER_1 English(EN) · Gal A. Kaminka ·

    A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

    Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastical…

  649. arXiv cs.LG TIER_1 English(EN) · Yuxin Bai, Aranyak Acharyya, Ashwin De Silva, Zeyu Shen, James Hassett, Joshua T. Vogelstein ·

    Optimal control of the future via prospective learning with control

    arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the …

  650. arXiv cs.LG TIER_1 English(EN) · Shan Yang, Yang Liu ·

    Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

    arXiv:2602.20078v3 Announce Type: replace-cross Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on …

  651. arXiv cs.LG TIER_1 English(EN) · Cyrille Kone, Kevin Jamieson ·

    Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

    arXiv:2605.03921v1 Announce Type: new Abstract: We study the $(\varepsilon, \delta)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer fr…

  652. arXiv cs.AI TIER_1 English(EN) · Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun ·

    T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    arXiv:2605.02178v1 Announce Type: new Abstract: Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and …

  653. arXiv cs.LG TIER_1 English(EN) · Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, ·

    Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including…

  654. arXiv cs.AI TIER_1 English(EN) · Dahyun Oh, Minhyuk Yoon, H. Jin Kim ·

    Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

    arXiv:2605.01865v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation…

  655. arXiv cs.LG TIER_1 English(EN) · Prakhar Gupta, Vaibhav Gupta ·

    Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

    arXiv:2512.04277v3 Announce Type: replace Abstract: Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during …

  656. arXiv cs.LG TIER_1 English(EN) · Jingchu Gai, Laixi Shi ·

    Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation

    arXiv:2605.03125v1 Announce Type: new Abstract: Multi-agent reinforcement learning (MARL) holds great potential but faces robustness challenges due to environmental uncertainty. To address this, distributionally robust Markov games (RMGs) optimize worst-case performance when the …

  657. arXiv cs.LG TIER_1 English(EN) · Kevin Jamieson ·

    Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

    We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to im…

  658. arXiv cs.CL TIER_1 English(EN) · Mehmet Iscan ·

    Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

    arXiv:2605.01567v1 Announce Type: cross Abstract: Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic re…

  659. arXiv cs.CL TIER_1 English(EN) · Yifan Zhang, Lanser Contributors ·

    Reinforcement Learning from Compiler and Language Server Feedback

    arXiv:2510.22907v2 Announce Type: replace Abstract: Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers…

  660. arXiv cs.AI TIER_1 English(EN) · Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu ·

    AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

    arXiv:2605.00425v1 Announce Type: new Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only re…

  661. arXiv cs.LG TIER_1 English(EN) · Ruoning Zhang, Siying Wang, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang, Ruijie Zhang, Stefano V. Albrecht ·

    Optimistic {\epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

    arXiv:2502.03506v2 Announce Type: replace-cross Abstract: The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, conventional methods based on CTDE can suffer from value underestimation and …

  662. arXiv cs.LG TIER_1 English(EN) · Jongsoo Lee, Jangwon Kim, Soohee Han ·

    Delayed homomorphic reinforcement learning for environments with delayed feedback

    arXiv:2604.03641v2 Announce Type: replace Abstract: Reinforcement learning in real-world systems often involves delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical augmentation-based approaches cause state-space explosion, which i…

  663. arXiv cs.LG TIER_1 English(EN) · Kejiang Qian, Amos Storkey, Fengxiang He ·

    Rationality Measurement and Theory for Reinforcement Learning Agents

    arXiv:2602.04737v2 Announce Type: replace Abstract: This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it …

  664. arXiv cs.LG TIER_1 English(EN) · Lipeng Zu, Yu Qian, Shayok Chakraborty, Xiaonan Zhang ·

    From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Release for Offline-to-Online Reinforcement Learning

    arXiv:2511.03828v2 Announce Type: replace Abstract: Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves dur…

  665. arXiv cs.LG TIER_1 English(EN) · Juan Sebastian Rojas, Chi-Guhn Lee ·

    Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning

    arXiv:2510.02945v3 Announce Type: replace Abstract: Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance…

  666. arXiv cs.LG TIER_1 English(EN) · Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner ·

    Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

    arXiv:2605.02528v1 Announce Type: cross Abstract: Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. Wh…

  667. arXiv cs.LG TIER_1 English(EN) · Yiheng Zhang, Yiming Wang, Kaiyan Zhao, Zhenglin Wan, Jiayu Chen, Leong Hou U ·

    ANO: A Principled Approach to Robust Policy Optimization

    arXiv:2605.02320v1 Announce Type: cross Abstract: Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clippin…

  668. arXiv cs.LG TIER_1 English(EN) · Haohan Yu, Jinmiao Cong, Shengzhi Wang, Lu Wang, Chanjuan Liu ·

    MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning

    arXiv:2605.01805v1 Announce Type: cross Abstract: A key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals necessitates the ability to quantify the true, long-term ca…

  669. arXiv cs.LG TIER_1 English(EN) · Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li ·

    Segment-Aligned Policy Optimization for Multi-Modal Reasoning

    arXiv:2605.01327v1 Announce Type: cross Abstract: Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the na…

  670. arXiv cs.LG TIER_1 English(EN) · Marc Dymetman ·

    Binary Rewards and Reinforcement Learning: Fundamental Challenges

    arXiv:2605.02375v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improve…

  671. arXiv cs.LG TIER_1 English(EN) · Sanjiv R. Das, Harshad Khadilkar, Sukrit Mittal, Daniel Ostrov, Deep Srivastav, Hungjen Wang ·

    A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    arXiv:2605.02300v1 Announce Type: new Abstract: Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) …

  672. arXiv cs.LG TIER_1 English(EN) · Ujjwal Patil, Javad Ghofrani ·

    Combining Trained Models in Reinforcement Learning

    arXiv:2605.02159v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) has delivered strong results in domains such as Atari and Go, but it still suffers from high sample cost and weak transfer beyond the training setting. A common response is to reuse information from…

  673. arXiv cs.LG TIER_1 English(EN) · Rudray Dave, Vedang Dubey, Smit Deoghare, Sudhakar Mishra ·

    Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

    arXiv:2605.01823v1 Announce Type: new Abstract: Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-…

  674. arXiv cs.CL TIER_1 English(EN) · Seonglae Cho, Zekun Wu, Adriano Koshiyama ·

    Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

    arXiv:2602.10437v3 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Rei…

  675. Hugging Face Daily Papers TIER_1 English(EN) ·

    Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

    Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable dive…

  676. arXiv cs.LG TIER_1 English(EN) · Peter Detzner ·

    Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

    Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable dive…

  677. Hugging Face Daily Papers TIER_1 English(EN) ·

    Middle-mile logistics through the lens of goal-conditioned reinforcement learning

    Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs f…

  678. arXiv cs.LG TIER_1 English(EN) · Marc Dymetman ·

    Binary Rewards and Reinforcement Learning: Fundamental Challenges

    Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes fal…

  679. Hugging Face Daily Papers TIER_1 English(EN) ·

    Binary Rewards and Reinforcement Learning: Fundamental Challenges

    Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes fal…

  680. arXiv cs.LG TIER_1 English(EN) · Leong Hou U ·

    ANO: A Principled Approach to Robust Policy Optimization

    Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gr…

  681. arXiv cs.LG TIER_1 English(EN) · Hungjen Wang ·

    A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems. Each GBWM problem involves a multiple …

  682. arXiv cs.LG TIER_1 English(EN) · Guangyu Zhao, Kewei Lian, Haoxuan Ru, Borong Zhang, Haowei Lin, Zhancun Mu, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang ·

    Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

    arXiv:2412.02125v2 Announce Type: replace-cross Abstract: Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass …

  683. arXiv cs.LG TIER_1 English(EN) · Jiaming Zhang, Yujie Yang, Yao Lyu, Shengbo Eben Li, Liping Zhang ·

    Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

    arXiv:2605.00667v1 Announce Type: new Abstract: Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires …

  684. arXiv cs.LG TIER_1 English(EN) · Washim Uddin Mondal, Vaneet Aggarwal ·

    Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

    arXiv:2408.11513v2 Announce Type: replace Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entrop…

  685. arXiv cs.LG TIER_1 English(EN) · Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon ·

    Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

    arXiv:2512.04341v3 Announce Type: replace Abstract: Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bay…

  686. arXiv cs.CL TIER_1 English(EN) · Zhichao Wang (James), Kiran Ramnath (James), Bin Bi (James), Shiva Kumar Pentyala (James), Sougata Chaudhuri (James), Shubham Mehrotra (James), Zixu (James), Zhu (Claire), Xiang-Bo Mao (Claire), Sitaram Asur (Claire), Na (Claire), Cheng ·

    Reinforcement Learning for LLM Post-Training: A Survey

    arXiv:2407.16216v3 Announce Type: replace Abstract: Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training…

  687. arXiv cs.LG TIER_1 English(EN) · Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rong Luo, Jing Gao ·

    PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

    arXiv:2510.26020v2 Announce Type: replace-cross Abstract: Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents from outcome-only rewards …

  688. arXiv cs.LG TIER_1 English(EN) · Yikai Wang, Shang Liu, Jose Blanchet ·

    Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

    arXiv:2605.00155v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations researc…

  689. arXiv cs.LG TIER_1 English(EN) · Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin ·

    Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    arXiv:2605.00347v1 Announce Type: new Abstract: Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-…

  690. arXiv cs.LG TIER_1 English(EN) · Haichen Hu, Jian Qian, David Simchi-Levi ·

    Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

    arXiv:2605.00393v1 Announce Type: new Abstract: Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. Whi…

  691. arXiv cs.LG TIER_1 English(EN) · Tao Li, Kaiyuan Hou, Tuan Vinh, Monika Raj, Zhichun Guo, Carl Yang ·

    Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

    arXiv:2604.07669v2 Announce Type: replace Abstract: Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enf…

  692. arXiv cs.LG TIER_1 English(EN) · Preston Rozwood, Edward Mehrez, Ludger Paehler, Wen Sun, Steven L. Brunton ·

    Koopman-Assisted Reinforcement Learning

    arXiv:2403.02290v2 Announce Type: replace-cross Abstract: The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman equation, are ubiquitous in reinforcement learning and control theory. However, these equations become intractable for high-dimensional or nonlinear…

  693. arXiv cs.LG TIER_1 English(EN) · Andrzej Ruszczynski, Tiangang Zhang ·

    Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

    arXiv:2605.00654v1 Announce Type: new Abstract: For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the …

  694. arXiv cs.LG TIER_1 English(EN) · Anamika Lochab, Bolian Li, Ruqi Zhang ·

    Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

    arXiv:2605.00365v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collaps…

  695. Hugging Face Daily Papers TIER_1 English(EN) ·

    T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervas…

  696. arXiv cs.AI TIER_1 English(EN) · Liping Zhang ·

    Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

    Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessita…

  697. arXiv cs.AI TIER_1 English(EN) · Jianmin Wu ·

    AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

    Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to indi…

  698. arXiv cs.LG TIER_1 English(EN) · David Simchi-Levi ·

    Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

    Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-…

  699. arXiv cs.AI TIER_1 English(EN) · Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati ·

    FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

    arXiv:2306.10407v3 Announce Type: replace-cross Abstract: Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision pro…

  700. arXiv cs.AI TIER_1 English(EN) · Alexandros Evangelidis, Gricel V\'azquez, Simos Gerasimou ·

    Accelerating Policy Synthesis in Large-Scale MDPs via Hierarchical Adaptive Refinement

    arXiv:2506.17792v2 Announce Type: replace Abstract: Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional pol…

  701. arXiv cs.AI TIER_1 English(EN) · Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun ·

    Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

    arXiv:2603.09117v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in inco…

  702. arXiv cs.AI TIER_1 English(EN) · Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp F\"urnstahl, Bernhard Sch\"olkopf, Andreas Krause ·

    Bounded Ratio Reinforcement Learning

    arXiv:2604.18578v3 Announce Type: replace-cross Abstract: Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect betwee…

  703. arXiv cs.LG TIER_1 English(EN) · Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang ·

    Co-Evolving Policy Distillation

    arXiv:2604.27083v1 Announce Type: new Abstract: RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mi…

  704. arXiv cs.AI TIER_1 English(EN) · Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin ·

    PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

    arXiv:2604.28123v1 Announce Type: cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distrib…

  705. arXiv cs.LG TIER_1 English(EN) · Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko ·

    Bayesian policy gradient and actor-critic algorithms

    arXiv:2604.27563v1 Announce Type: new Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, …

  706. arXiv cs.LG TIER_1 English(EN) · Haiyang Zhao ·

    Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

    arXiv:2604.27411v1 Announce Type: new Abstract: Visual model-based reinforcement learning (MBRL) agents can perform well on the training distribution, but often break down once the test environment shifts. In visual MBRL, recognizing that a shift has occurred is often the easier …

  707. arXiv cs.LG TIER_1 English(EN) · Buqing Ou, Frederike D\"umbgen ·

    Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

    arXiv:2604.27667v1 Announce Type: cross Abstract: Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good perfo…

  708. arXiv cs.LG TIER_1 English(EN) · Eason Yu, Tzu Hao Liu, Cl\'ement L. Canonne, Yunke Wang, Chang Xu, Nguyen H. Tran, Stefano V. Albrecht ·

    NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

    arXiv:2510.18183v2 Announce Type: replace Abstract: Finding Nash equilibria in two-player zero-sum imperfect-information games remains a central challenge in multi-agent reinforcement learning. Recent multi-round regularization methods offer a promising direction, yet existing ap…

  709. arXiv cs.AI TIER_1 English(EN) · Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn ·

    EXPO: Stable Reinforcement Learning with Expressive Policies

    arXiv:2507.07986v3 Announce Type: replace-cross Abstract: We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable …

  710. arXiv cs.CL TIER_1 English(EN) · Chi Jin ·

    Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human traj…

  711. arXiv cs.LG TIER_1 English(EN) · Frederike Dümbgen ·

    Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

    Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initializatio…

  712. Hugging Face Daily Papers TIER_1 English(EN) ·

    Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

    Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initializatio…

  713. arXiv cs.LG TIER_1 English(EN) · Michal Valko ·

    Bayesian policy gradient and actor-critic algorithms

    Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, which tend to have high variance, requiring many…

  714. arXiv cs.CL TIER_1 English(EN) · Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson, Kenny Smith ·

    Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

    arXiv:2602.21720v2 Announce Type: replace Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learn…

  715. arXiv cs.CL TIER_1 English(EN) · Xia Zeng, Yihan Chen, Luhui Liu, Chao Luo, Ye Chen, Zhuoran Zhuang ·

    Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

    arXiv:2510.04214v3 Announce Type: replace Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guar…

  716. arXiv cs.CL TIER_1 English(EN) · Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang ·

    Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

    arXiv:2509.16591v2 Announce Type: replace Abstract: Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regu…

  717. arXiv cs.LG TIER_1 English(EN) · Ankita Kushwaha, Kiran Ravish, Preeti Lamba, Pawan Kumar ·

    A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

    arXiv:2505.17342v2 Announce Type: replace Abstract: Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview o…

  718. arXiv cs.AI TIER_1 English(EN) · Seungyub Han, Hyungjin Kim, Jungwoo Lee ·

    Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

    arXiv:2604.26516v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-b…

  719. arXiv cs.LG TIER_1 English(EN) · Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan ·

    Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

    arXiv:2508.19900v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered…

  720. arXiv cs.AI TIER_1 English(EN) · Jungwoo Lee ·

    Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

    Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation i…

  721. arXiv cs.LG TIER_1 English(EN) · Ihor Vitenko, Noha Ibrahim, Sihem Amer-Yahia ·

    Lever: Inference-Time Policy Reuse under Support Constraints

    arXiv:2604.20174v2 Announce Type: replace Abstract: Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new compo…

  722. arXiv cs.LG TIER_1 English(EN) · Alexandru Cioba, Aya Kayal, Laura Toni, Sattar Vakili, Alberto Bernacchia ·

    Reinforcement Learning Using known Invariances

    arXiv:2511.03473v2 Announce Type: replace Abstract: In many real-world reinforcement learning (RL) problems, the environment exhibits inherent symmetries that can be exploited to improve learning efficiency. This paper develops a theoretical and algorithmic framework for incorpor…

  723. arXiv cs.LG TIER_1 English(EN) · Artur Eisele, Bernd Frauenknecht, Friedrich Solowjow, Sebastian Trimpe ·

    Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

    arXiv:2604.25508v1 Announce Type: new Abstract: Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dy…

  724. arXiv cs.LG TIER_1 English(EN) · Dominik \.Zurek, Kamil Faber, Marcin Pietron, Pawe{\l} Gajewski, Roberto Corizzo ·

    TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

    arXiv:2604.25898v1 Announce Type: new Abstract: Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise …

  725. arXiv cs.LG TIER_1 English(EN) · Ali Al Housseini, Cristina Rottondi, Omran Ayoub ·

    Hierarchical Reinforcement Learning for the Dynamic VNE with Alternatives Problem

    arXiv:2512.05207v2 Announce Type: replace-cross Abstract: Virtual Network Embedding (VNE) is a key enabler of network slicing, yet most formulations assume that each Virtual Network Request (VNR) has a fixed topology. Recently, VNE with Alternative topologies (VNEAP) was introduc…

  726. arXiv cs.LG TIER_1 English(EN) · Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban ·

    Policy Improvement Reinforcement Learning

    arXiv:2604.00860v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize p…

  727. arXiv cs.AI TIER_1 English(EN) · Roberto Corizzo ·

    TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

    Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live enviro…

  728. Hugging Face Daily Papers TIER_1 English(EN) ·

    TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

    Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live enviro…

  729. arXiv cs.AI TIER_1 English(EN) · Daniele Meli ·

    Sample-efficient Neuro-symbolic Proximal Policy Optimization

    Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers pa…

  730. arXiv cs.LG TIER_1 English(EN) · Sebastian Trimpe ·

    Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

    Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dynamics. We propose Dyna-style Safety Augmented R…

  731. arXiv cs.AI TIER_1 English(EN) · Karol Desnos ·

    Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

    Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (…

  732. arXiv cs.LG TIER_1 English(EN) · Zijian Guo, \.Ilker I\c{s}{\i}k, H. M. Sabbir Ahmad, Wenchao Li ·

    SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

    arXiv:2604.24729v1 Announce Type: new Abstract: Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promis…

  733. arXiv cs.CL TIER_1 English(EN) · Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li ·

    Reinforcement Learning with Backtracking Feedback

    arXiv:2602.08377v2 Announce Type: replace-cross Abstract: Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). Th…

  734. arXiv cs.LG TIER_1 English(EN) · Stela Tong, Elai Ben-Gal ·

    CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

    arXiv:2604.22785v1 Announce Type: new Abstract: Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal…

  735. arXiv cs.LG TIER_1 English(EN) · Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, Ozlem Garibay, Niloofar Yousefi ·

    When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

    arXiv:2604.22873v1 Announce Type: new Abstract: Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or gove…

  736. arXiv cs.LG TIER_1 English(EN) · Zixuan Xia, Quanxi Li ·

    K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

    arXiv:2604.23056v1 Announce Type: new Abstract: We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recur…

  737. arXiv cs.LG TIER_1 English(EN) · Rahul Narava, Siddharth Verma, Ojas Jain, Shashi Shekhar Jha, Mayank Shekhar Jha ·

    CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

    arXiv:2604.23576v1 Announce Type: new Abstract: Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to s…

  738. arXiv cs.LG TIER_1 English(EN) · Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng ·

    TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

    arXiv:2604.24005v1 Announce Type: new Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent se…

  739. arXiv cs.LG TIER_1 English(EN) · Atahan Cilan, Mahir Demir, \"Ozg\"un Can Y\"ur\"utken, Seyyid Osman Sevgili, \"Umit Can Bekar ·

    Perfecting Aircraft Maneuvers with Reinforcement Learning

    arXiv:2604.24338v1 Announce Type: new Abstract: This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A m…

  740. arXiv cs.LG TIER_1 English(EN) · Ying-Tu Chen, Wei Hung, Bing-Shu Wu, Zhang-Wei Hong, Ping-Chun Hsieh ·

    A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

    arXiv:2604.24532v1 Announce Type: new Abstract: Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} ad…

  741. arXiv cs.CL TIER_1 English(EN) · Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao ·

    DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

    arXiv:2604.24320v1 Announce Type: new Abstract: Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental und…

  742. arXiv cs.LG TIER_1 English(EN) · Shipeng Li, Zhiqin Yang, Shikun Li, Xiaobo Xia, Hengyu Liu, Xinghua Zhang, Gaode Chen, Dong Fang, Ying Tai, Zhe Peng ·

    LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

    arXiv:2506.11480v4 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we p…

  743. arXiv cs.LG TIER_1 English(EN) · Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh ·

    Polychromic Objectives for Reinforcement Learning

    arXiv:2509.25424v5 Announce Type: replace Abstract: Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising b…

  744. arXiv cs.AI TIER_1 English(EN) · Donghwan Lee ·

    Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

    arXiv:2604.17457v3 Announce Type: replace-cross Abstract: Q-value iteration (Q-VI) is usually analyzed through the \(\gamma\)-contraction of the Bellman operator. This argument proves convergence to \(Q^*\), but it gives only a coarse account of when the induced greedy policy bec…

  745. arXiv cs.LG TIER_1 English(EN) · Wenchao Li ·

    SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

    Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across …

  746. arXiv cs.LG TIER_1 English(EN) · Ping-Chun Hsieh ·

    A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

    Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} addresses this by training a single policy network…

  747. Hugging Face Daily Papers TIER_1 English(EN) ·

    Perfecting Aircraft Maneuvers with Reinforcement Learning

    This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulat…

  748. arXiv cs.LG TIER_1 English(EN) · Ümit Can Bekar ·

    Perfecting Aircraft Maneuvers with Reinforcement Learning

    This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulat…

  749. arXiv cs.CL TIER_1 English(EN) · Shen Gao ·

    DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

    Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single …

  750. arXiv cs.CL TIER_1 English(EN) · Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu ·

    UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

    arXiv:2508.06165v4 Announce Type: replace Abstract: Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex …

  751. arXiv cs.LG TIER_1 English(EN) · Anne E. Staples ·

    Insect-inspired modular architectures as inductive biases for reinforcement learning

    arXiv:2604.22081v1 Announce Type: new Abstract: Most reinforcement-learning (RL) controllers used in continuous control are architecturally centralized: observations are compressed into a single latent state from which both value estimates and actions are produced. Biological con…

  752. arXiv cs.LG TIER_1 English(EN) · Peiyan Zhang, Hanmo Liu, Chengxuan Tong, Yuxia Wu, Wei Guo, Yong Liu ·

    ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

    arXiv:2604.22169v1 Announce Type: new Abstract: Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at al…

  753. arXiv cs.LG TIER_1 English(EN) · Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang ·

    Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

    arXiv:2604.22229v1 Announce Type: new Abstract: One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset…

  754. arXiv cs.LG TIER_1 English(EN) · Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng ·

    SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

    arXiv:2604.22558v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GU…

  755. arXiv cs.LG TIER_1 English(EN) · Rashmeet Kaur Nayyar, Naman Shah, Siddharth Srivastava ·

    Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

    arXiv:2512.20831v2 Announce Type: replace-cross Abstract: Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed.…

  756. arXiv cs.LG TIER_1 English(EN) · Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor ·

    AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning

    arXiv:2511.14135v2 Announce Type: replace Abstract: Fair workload enforcement in heterogeneous multi-agent systems that pursue shared objectives remains challenging. Fixed fairness penalties often introduce inefficiencies, training instability, and conflicting agent incentives. R…

  757. arXiv cs.AI TIER_1 English(EN) · Lingfang Zeng ·

    SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

    As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilem…

  758. arXiv cs.AI TIER_1 English(EN) · Chi Zhang ·

    Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

    One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipe…

  759. arXiv cs.AI TIER_1 English(EN) · Yong Liu ·

    ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

    Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast lea…

  760. arXiv cs.LG TIER_1 English(EN) · Sukesh Subaharan ·

    Dynamical Priors as a Training Objective in Reinforcement Learning

    Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or …

  761. Hugging Face Daily Papers TIER_1 English(EN) ·

    Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in grou…

  762. X — Mira Murati TIER_1 English(EN) · Mira Murati ·

    Combining the benefits of RL and SFT with on-policy distillation, a promising approach for training small models for domain performance and continual ...

    Combining the benefits of RL and SFT with on-policy distillation, a promising approach for training small models for domain performance and continual learning.<div class="rsshub-quote"><br /><br />Thinking Machines: Our latest post explores on-policy distillation, a training appr…

  763. arXiv stat.ML TIER_1 English(EN) · Zhiheng Zhang ·

    Wasserstein Policy Learning for Distributional Outcomes

    Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare defined as the mean of scalar-valued potential out…

  764. arXiv stat.ML TIER_1 English(EN) · Tengyang Xie ·

    When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

    Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first …

  765. arXiv cs.CV TIER_1 English(EN) · Mohamed Jismy Aashik Rasool, Shabir Ahmad, Gisong Oh, Teag Kuen Whangbo ·

    SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

    arXiv:2606.15243v1 Announce Type: new Abstract: Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Exist…

  766. arXiv cs.CV TIER_1 English(EN) · Shaivi Malik ·

    Reinforcement Learning for Neural Model Editing

    arXiv:2606.13461v1 Announce Type: cross Abstract: Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formula…

  767. arXiv stat.ML TIER_1 English(EN) · Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin ·

    Learning with Simulators: No Regret in a Computationally Bounded World

    arXiv:2606.13576v1 Announce Type: cross Abstract: Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, wh…

  768. arXiv cs.CV TIER_1 English(EN) · Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou ·

    From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

    arXiv:2507.22028v2 Announce Type: replace Abstract: Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to …

  769. arXiv stat.ML TIER_1 English(EN) · Alexander Rakhlin ·

    Learning with Simulators: No Regret in a Computationally Bounded World

    Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far mo…

  770. arXiv stat.ML TIER_1 English(EN) · Tommaso Giorgi, Pierriccardo Olivieri, Keyue Jiang, Laura Toni, Matteo Papini ·

    Impact of Connectivity on Laplacian Representations in Reinforcement Learning

    arXiv:2603.08558v3 Announce Type: replace-cross Abstract: Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches l…

  771. arXiv cs.CV TIER_1 English(EN) · Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong ·

    ReMoT: Reinforcement Learning with Motion Contrast Triplets

    arXiv:2603.00461v3 Announce Type: replace Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integ…

  772. arXiv stat.ML TIER_1 English(EN) · Alexander Ryabchenko, Wenlong Mou ·

    Reinforcement Learning with Action-Triggered Observations

    arXiv:2510.02149v2 Announce Type: replace-cross Abstract: We introduce Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, w…

  773. arXiv cs.CV TIER_1 English(EN) · Guillaume Henon-Just ·

    Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

    Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforc…

  774. arXiv stat.ML TIER_1 English(EN) · Haolin Liu, Braham Snyder, Chen-Yu Wei ·

    On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

    arXiv:2602.12107v2 Announce Type: replace-cross Abstract: We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited…

  775. arXiv stat.ML TIER_1 English(EN) · Xiaofeng Lin, Seungbae Kim, Zhuoya Li, Zachary DeSoto, Charles Fleming, Guang Cheng ·

    ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning

    arXiv:2603.10823v2 Announce Type: replace Abstract: Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving…

  776. arXiv stat.ML TIER_1 English(EN) · Thanh Nguyen-Tang, Raman Arora ·

    Exact Unlearning in Reinforcement Learning

    arXiv:2606.04182v1 Announce Type: cross Abstract: We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output…

  777. arXiv stat.ML TIER_1 English(EN) · Harin Lee, Kevin Jamieson ·

    Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

    arXiv:2603.03480v2 Announce Type: replace-cross Abstract: We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper…

  778. arXiv stat.ML TIER_1 English(EN) · Raman Arora ·

    Exact Unlearning in Reinforcement Learning

    We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output after unlearning is \emph{indistinguishable} from…

  779. arXiv stat.ML TIER_1 English(EN) · Raman Arora ·

    Minimax-Optimal Policy Regret in Partially Observable Markov Games

    arXiv:2606.02363v1 Announce Type: cross Abstract: We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial o…

  780. arXiv stat.ML TIER_1 English(EN) · Imad Aouali, Otmane Sakhi ·

    Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

    arXiv:2509.03456v2 Announce Type: replace Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assumin…

  781. arXiv stat.ML TIER_1 English(EN) · Volodymyr Tkachuk, Csaba Szepesv\'ari, Xiaoqi Tan ·

    Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability

    arXiv:2510.03494v2 Announce Type: replace-cross Abstract: We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for eit…

  782. arXiv stat.ML TIER_1 English(EN) · Raman Arora ·

    Minimax-Optimal Policy Regret in Partially Observable Markov Games

    We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavi…

  783. arXiv stat.ML TIER_1 English(EN) · Yike Zhao, Onno Eberhard, Malek Khammassi, Ali H. Sayed, Michael Muehlebach ·

    Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

    arXiv:2605.31261v1 Announce Type: cross Abstract: The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by cons…

  784. arXiv stat.ML TIER_1 English(EN) · Vittorio Giammarino, Anastasios Manganaris, Ahmed H. Qureshi ·

    Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

    arXiv:2605.30503v1 Announce Type: cross Abstract: Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies tha…

  785. arXiv stat.ML TIER_1 English(EN) · Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang ·

    Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

    arXiv:2605.31172v1 Announce Type: cross Abstract: This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA i…

  786. arXiv stat.ML TIER_1 English(EN) · Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata ·

    PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

    arXiv:2510.10544v3 Announce Type: replace-cross Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obt…

  787. LessWrong (AI tag) TIER_1 English(EN) · andyqhan ·

    How's it going? Reinforcement learning in language models recruits a functional welfare axis

    <p><i><span>In collaboration with David Chalmers and Pavel Izmailov. Work done at NYU. Andy wrote this summary of the paper, which you can find in full on the </span></i><a href="https://functionalwelfare.com" rel="noreferrer"><i><span>website</span></i></a><i><span>, or, if you …

  788. arXiv stat.ML TIER_1 English(EN) · Michael Muehlebach ·

    Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

    The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the …

  789. arXiv stat.ML TIER_1 English(EN) · Shangtong Zhang ·

    Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

    This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal dif…

  790. arXiv stat.ML TIER_1 English(EN) · Christoph Dann, Yishay Mansour, Mehryar Mohri ·

    Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

    arXiv:2605.29032v1 Announce Type: cross Abstract: Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a real…

  791. arXiv stat.ML TIER_1 English(EN) · Dorival Le\~ao, Alberto Ohashi, Simone Scotti, Adolfo M. D da Silva ·

    Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version

    arXiv:2604.13147v2 Announce Type: replace Abstract: This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equat…

  792. arXiv stat.ML TIER_1 English(EN) · Ahmed H. Qureshi ·

    Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

    Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization…

  793. arXiv stat.ML TIER_1 English(EN) · Wonyoung Kim, Min-Hwan Oh, Garud Iyengar, Assaf Zeevi ·

    Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

    arXiv:2605.28364v1 Announce Type: new Abstract: Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-ca…

  794. arXiv stat.ML TIER_1 English(EN) · Sebastian Sanokowski, Kaustubh Patil ·

    Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

    arXiv:2512.02019v3 Announce Type: replace-cross Abstract: Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajecto…

  795. arXiv stat.ML TIER_1 English(EN) · Guang-Yuan Hao, Lars van der Laan, Aur\'elien Bibaut, Nathan Kallus ·

    Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

    arXiv:2605.27834v1 Announce Type: cross Abstract: We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are c…

  796. arXiv stat.ML TIER_1 English(EN) · Mohammadmahdi Ghasemloo, David J. Eckman, Yaxian Li ·

    Accelerating Reinforcement Learning Training Using Simulation Surrogate Models

    arXiv:2605.27556v1 Announce Type: new Abstract: High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model's input-output rela…

  797. arXiv stat.ML TIER_1 English(EN) · Mehryar Mohri ·

    Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

    Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but f…

  798. arXiv stat.ML TIER_1 English(EN) · Assaf Zeevi ·

    Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

    Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance…

  799. arXiv stat.ML TIER_1 English(EN) · Shengbo Wang, Jose Blanchet, Peter Glynn ·

    Fast Convergence of Policy Regret in Learning Stochastic Optimal Control

    arXiv:2605.26361v1 Announce Type: cross Abstract: Policy learning in modern operations environments faces a fundamental tension between limited operational data and the large, often continuous, state and action spaces over which good decisions must be identified and deployed. We …

  800. arXiv stat.ML TIER_1 English(EN) · Nathan Kallus ·

    Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

    We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate…

  801. arXiv stat.ML TIER_1 English(EN) · Yaxian Li ·

    Accelerating Reinforcement Learning Training Using Simulation Surrogate Models

    High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model's input-output relationship. In parallel, reinforcement learning (R…

  802. arXiv stat.ML TIER_1 English(EN) · Peter Glynn ·

    Fast Convergence of Policy Regret in Learning Stochastic Optimal Control

    Policy learning in modern operations environments faces a fundamental tension between limited operational data and the large, often continuous, state and action spaces over which good decisions must be identified and deployed. We study value-based policy learning in stochastic op…

  803. arXiv stat.ML TIER_1 English(EN) · Chengchun Shi ·

    Counterfactually Safe Reinforcement Learning

    Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the noti…

  804. arXiv stat.ML TIER_1 English(EN) · Taiji Suzuki ·

    How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

    Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with…

  805. arXiv cs.CV TIER_1 English(EN) · Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing ·

    ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    arXiv:2605.20342v2 Announce Type: replace Abstract: Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dis…

  806. arXiv stat.ML TIER_1 English(EN) · Jongchan Park ·

    Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

    arXiv:2605.21557v1 Announce Type: new Abstract: Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation d…

  807. arXiv stat.ML TIER_1 English(EN) · Oliver Mortensen, Mohammad Sadegh Talebi ·

    On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

    arXiv:2605.21763v1 Announce Type: cross Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which…

  808. arXiv stat.ML TIER_1 English(EN) · Mohammad Sadegh Talebi ·

    On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

    We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic…

  809. arXiv stat.ML TIER_1 English(EN) · Jongchan Park ·

    Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

    Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data …

  810. arXiv stat.ML TIER_1 English(EN) · Zijun Chen, Zihan Zhang ·

    Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

    arXiv:2605.15692v1 Announce Type: cross Abstract: We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret ag…

  811. arXiv stat.ML TIER_1 English(EN) · Maryam Kamgarpour ·

    Fast Rates for Inverse Reinforcement Learning

    We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimat…

  812. arXiv stat.ML TIER_1 English(EN) · Ian Osband ·

    Delightful Distributed Policy Gradient

    arXiv:2603.20521v2 Announce Type: replace-cross Abstract: Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising …

  813. arXiv stat.ML TIER_1 English(EN) · Tobias Schm\"ahling, Matthias Burkhardt, Tobias Windisch ·

    Trajectory-Level Data Augmentation for Offline Reinforcement Learning

    arXiv:2605.13401v1 Announce Type: cross Abstract: We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajector…

  814. arXiv stat.ML TIER_1 English(EN) · Yash Kanoria ·

    Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

    We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard mo…

  815. arXiv stat.ML TIER_1 English(EN) · Tobias Windisch ·

    Trajectory-Level Data Augmentation for Offline Reinforcement Learning

    We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation …

  816. arXiv stat.ML TIER_1 English(EN) · Maxime Haddouche, Otmane Sakhi ·

    Sequential Off-Policy Learning with Logarithmic Smoothing

    arXiv:2506.10664v2 Announce Type: replace Abstract: Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, polici…

  817. arXiv stat.ML TIER_1 English(EN) · Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli ·

    DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees

    arXiv:2604.16684v2 Announce Type: replace-cross Abstract: We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise stationary (PS) setting,…

  818. arXiv stat.ML TIER_1 English(EN) · Nam Phuong Tran, Andi Nika, Goran Radanovic, Long Tran-Thanh, Debmalya Mandal ·

    Sparse Offline Reinforcement Learning with Corruption Robustness

    arXiv:2512.24768v3 Announce Type: replace Abstract: We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse …

  819. arXiv stat.ML TIER_1 English(EN) · Aidan Gleich, Eric Laber, Alexander Volfovsky ·

    Adaptive Policy Learning Under Unknown Network Interference

    arXiv:2605.11191v1 Announce Type: new Abstract: Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in orde…

  820. arXiv stat.ML TIER_1 English(EN) · Seokmin Ko, Ambuj Tewari, Kihyuk Hong ·

    Offline Constrained Reinforcement Learning under Partial Data Coverage

    arXiv:2505.17506v2 Announce Type: replace Abstract: We study offline constrained reinforcement learning with general function approximation in discounted constrained Markov decision processes. Prior methods either require full data coverage for evaluating intermediate policies, l…

  821. arXiv stat.ML TIER_1 English(EN) · Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao ·

    TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

    arXiv:2605.11473v1 Announce Type: cross Abstract: Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diag…

  822. arXiv stat.ML TIER_1 English(EN) · Rui Miao ·

    TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

    Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously ov…

  823. arXiv stat.ML TIER_1 English(EN) · Alexander Volfovsky ·

    Adaptive Policy Learning Under Unknown Network Interference

    Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in order to maximize a cumulative outcome of interest (…

  824. arXiv stat.ML TIER_1 English(EN) · Guannan Qu ·

    Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

    This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improv…

  825. arXiv stat.ML TIER_1 English(EN) · Zaiwei Chen ·

    Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

    In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in whi…

  826. arXiv stat.ML TIER_1 English(EN) · Lars van der Laan, Nathan Kallus, Aurelien Bibaut ·

    Inverse Reinforcement Learning with Just Classification and a Few Regressions

    arXiv:2509.21172v2 Announce Type: replace-cross Abstract: Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery…

  827. arXiv stat.ML TIER_1 English(EN) · Xinyu Liu, Zixuan Xie, Shangtong Zhang ·

    Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

    arXiv:2605.07104v1 Announce Type: cross Abstract: Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic app…

  828. arXiv stat.ML TIER_1 English(EN) · Yuyang Zhang, Haldun Balim, Na Li ·

    Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

    arXiv:2605.07101v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), address…

  829. arXiv stat.ML TIER_1 English(EN) · Kun Long, Yuqiang Li, Xianyi Wu ·

    Improved Model-based Reinforcement Learning with Smooth Kernels

    arXiv:2605.07218v1 Announce Type: cross Abstract: For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive …

  830. arXiv stat.ML TIER_1 English(EN) · Lars van der Laan, Nathan Kallus ·

    Bellman Calibration for $V$-Learning in Offline Reinforcement Learning

    arXiv:2512.23694v2 Announce Type: replace Abstract: Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Be…

  831. LessWrong (AI tag) TIER_1 English(EN) · Oliver Sourbut ·

    Reinforcement learning scaling might incentivise hidden reasoning architectures for AI

    <p><span>In short: the </span><i><span>transformer</span></i><span> architecture brought massive scale to AI, and </span><i><span>also</span></i><span> provided partial guarantees of ‘reasoning out loud’, an unprecedentedly interpretable situation for AI. Reinforcement learning (…

  832. arXiv stat.ML TIER_1 English(EN) · Feng Ji ·

    Reinforcement Learning Measurement Model

    Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action val…

  833. arXiv stat.ML TIER_1 English(EN) · Xianyi Wu ·

    Improved Model-based Reinforcement Learning with Smooth Kernels

    For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-bas…

  834. arXiv stat.ML TIER_1 English(EN) · Shangtong Zhang ·

    Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

    Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are c…

  835. arXiv stat.ML TIER_1 English(EN) · Na Li ·

    Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

    Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In pr…

  836. arXiv stat.ML TIER_1 English(EN) · Lifeng Lai ·

    Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

    We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement p…

  837. arXiv stat.ML TIER_1 English(EN) · Li Song ·

    Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

    We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bo…

  838. arXiv stat.ML TIER_1 English(EN) · Onno Eberhard, Thibaut Cuvelier, Michal Valko, Bruno De Backer ·

    Middle-mile logistics through the lens of goal-conditioned reinforcement learning

    arXiv:2605.02461v1 Announce Type: new Abstract: Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with …

  839. arXiv stat.ML TIER_1 English(EN) · Bruno De Backer ·

    Middle-mile logistics through the lens of goal-conditioned reinforcement learning

    Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs f…

  840. arXiv stat.ML TIER_1 English(EN) · Tiangang Zhang ·

    Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

    For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in…

  841. arXiv stat.ML TIER_1 English(EN) · Rohan Tangri, Jan-Peter Calliess ·

    Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

    arXiv:2601.22993v3 Announce Type: replace-cross Abstract: We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empi…

  842. arXiv stat.ML TIER_1 English(EN) · Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang ·

    DDO-RM: Distribution-Level Policy Improvement after Reward Learning

    arXiv:2604.11119v2 Announce Type: replace Abstract: Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate deci…

  843. arXiv stat.ML TIER_1 English(EN) · Ruqi Zhang ·

    Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

    Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degra…

  844. arXiv stat.ML TIER_1 English(EN) · Jose Blanchet ·

    Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

    Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem u…

  845. arXiv cs.CV TIER_1 English(EN) · Chengwei Qin ·

    PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

    The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's o…

  846. arXiv stat.ML TIER_1 English(EN) · Zhenghao Li, Shengbo Wang, Nian Si ·

    Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

    arXiv:2505.12202v3 Announce Type: replace-cross Abstract: Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conse…

  847. arXiv stat.ML TIER_1 English(EN) · Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin ·

    When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    arXiv:2604.25872v1 Announce Type: cross Abstract: Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality o…

  848. arXiv stat.ML TIER_1 English(EN) · Noam Razin ·

    When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat i…

  849. arXiv stat.ML TIER_1 English(EN) · Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong ·

    CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

    arXiv:2604.23308v1 Announce Type: cross Abstract: Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they ca…

  850. arXiv stat.ML TIER_1 English(EN) · Elliot Fosong ·

    CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

    Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introdu…

  851. Smol AINews TIER_1 English(EN) ·

    Prime Intellect's INTELLECT-2 and PRIME-RL advance distributed reinforcement learning

    **Prime Intellect** released **INTELLECT-2**, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. **ByteDance** launched **DreamO**, a unified image customization model on Hugging Face. **Qwen** released models opt…

  852. Smol AINews TIER_1 English(EN) ·

    PRIME: Process Reinforcement through Implicit Rewards

    **Implicit Process Reward Models (PRIME)** have been highlighted as a significant advancement in online reinforcement learning, trained on a **7B model** with impressive results compared to **gpt-4o**. The approach builds on the importance of process reward models established by …

  853. Eugene Yan TIER_1 English(EN) ·

    Reinforcement Learning for Recommendations and Search

    Focusing on long-term rewards, exploration, and frequently updated item.

  854. Modal blog TIER_1 English(EN) ·

    Reinforcement learning is an infrastructure problem

    What we've seen helping teams run Reinforcement Learning at scale on Modal. Plus an open-source library to skip the scaffolding.

  855. Modal blog TIER_1 English(EN) ·

    Scaling Reinforcement Learning at Applied Compute

    How Applied Compute trains custom agents with Reinforcement Learning for enterprises like DoorDash, Cognition, and Mercor on Modal.

  856. AWS Machine Learning Blog TIER_1 English(EN) · Surya Kari ·

    Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

    In this post, you will learn how to implement reinforcement learning with verifiable rewards (RLVR) to introduce verification and transparency into reward signals to improve training performance. This approach works best when outputs can be objectively verified for correctness, s…

  857. Together AI blog TIER_1 English(EN) ·

    Together AI and Meta partner to bring PyTorch Reinforcement Learning to the AI Native Cloud

    Build, train, and deploy advanced AI agents with integrated reinforcement learning on the Together platform.

  858. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    Exploring deep reinforcement learning

    <p>In addition to being a Developer Advocate at Hugging Face, Thomas Simonini is building next-gen AI in games that can talk and have smart interactions with the player using Deep Reinforcement Learning (DRL) and Natural Language Processing (NLP). He also created a Deep Reinforce…

  859. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    Reinforcement Learning for search

    <p>Hamish from Sajari blows our mind with a great discussion about AI in search. In particular, he talks about Sajari’s quest for performant AI implementations and extensive use of Reinforcement Learning (RL). We’ve been wanting to make this one happen for a while, and it was wel…

  860. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    Reinforcement learning for chip design

    <p>Daniel and Chris have a fascinating discussion with Anna Goldie and Azalia Mirhoseini from Google Brain about the use of reinforcement learning for chip floor planning - or placement - in which many new designs are generated, and then evaluated, to find an optimal component la…

  861. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    Deep Reinforcement Learning

    <p>While attending the NVIDIA GPU Technology Conference in Silicon Valley, Chris met up with Adam Stooke, a speaker and PhD student at UC Berkeley who is doing groundbreaking work in large-scale deep reinforcement learning and robotics. Adam took Chris on a tour of deep reinforce…

  862. Lex Fridman Podcast TIER_1 English(EN) · Lex Fridman ·

    Leslie Kaelbling: Reinforcement Learning, Planning, and Robotics

    <p>Leslie Kaelbling is a roboticist and professor at MIT. She is recognized for her work in reinforcement learning, planning, robot navigation, and several other topics in AI. She won the IJCAI Computers and Thought Award and was the editor-in-chief of the prestigious Journal of …

  863. Lex Fridman Podcast TIER_1 Nederlands(NL) · Lex Fridman ·

    Pieter Abbeel: Deep Reinforcement Learning

    <p>Pieter Abbeel is a professor at UC Berkeley, director of the Berkeley Robot Learning Lab, and is one of the top researchers in the world working on how to make robots understand and interact with the world around them, especially through imitation and deep reinforcement learni…

  864. Medium — Claude tag TIER_1 English(EN) · Thirupathi Pavan Sai ·

    How machines learn: supervised, unsupervised & reinforcement learning

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@thirupathipavansai/how-machines-learn-supervised-unsupervised-reinforcement-learning-2f8a5ae8961d?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/2600/0*EWb6ZOvetQIJ8rw…

  865. Medium — Claude tag TIER_1 English(EN) · Abhishekrout ·

    The Secret Behind ChatGPT Bitter Lesson + Core Loop (RL-Reinforcement Learning)

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@abhishekrout77/the-secret-behind-chatgpt-bitter-lesson-core-loop-rl-reinforcement-learning-40cf97d7104c?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/1536/1*nnTWzfO_n…

  866. Towards AI TIER_1 English(EN) · Deepanshu Gupta ·

    Reinforcement Learning: The Post-Training Engine Behind Reasoning Models

    <h4>Reinforcement learning used to feel like a branch of AI reserved for games, robotics, recommendation systems, and control.</h4><p>It was the world of agents, environments, rewards, policies, simulators, self-play, exploration, and long-horizon decisions. The defining question…

  867. Mastodon — mastodon.social TIER_1 日本語(JA) · ymbot ·

    Unraveling Agentic Reinforcement Learning in GPT-OSS: A Practical Retrospective https:// huggingface.co/blog/LinkedIn/g pt-oss-agentic-rl *AI-generated auto-post (headline + link) # AI # GenerativeAI # LLM # AIGenerated

    【GPT-OSSにおけるエージェント型強化学習の解明:実践的な回顧】 https:// huggingface.co/blog/LinkedIn/g pt-oss-agentic-rl ※AI生成の自動投稿(見出し+リンク) # AI # 生成AI # LLM # AIGenerated

  868. Mastodon — mastodon.social TIER_1 English(EN) · jonathannnnn ·

    A look at how reinforcement learning can lead to “reward hacking,” where AI finds shortcuts to maximize rewards without truly achieving the intended goal. It hi

    A look at how reinforcement learning can lead to “reward hacking,” where AI finds shortcuts to maximize rewards without truly achieving the intended goal. It highlights how reward design shapes AI behavior. # AI # MachineLearning # AIsafety Read more: https:// solihullpublishing.…

  869. Mastodon — mastodon.social TIER_1 English(EN) · aihaberleri ·

    📰 2026 Breakthrough: OpenAI Eliminates Parameter Updates in Reinforcement Learning with Python Scripts A groundbreaking reinforcement learning paradigm develope

    📰 2026 Breakthrough: OpenAI Eliminates Parameter Updates in Reinforcement Learning with Python Scripts A groundbreaking reinforcement learning paradigm developed by OpenAI researcher Jia-Yi Weng eliminates the need for parameter updates, enabling AI agents to make decisions by ge…

  870. Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri ·

    📰 New Learning Method: Parameter-Free Reinforcement Learning OpenAI researchers, AI making decisions on its own without updating parameters

    📰 Yeni Öğrenme Yöntemi: Parametre Güncellemesiz Reinforcement Learning OpenAI araştırmacıları, parametreleri güncellemeden yapay zekanın kendi kendine karar vermesini sağlayan yeni bir reinforcement learning范式 sundu. Bu yöntem, AI'nin bir .py dosyası yazarak öğrenmesini sağlıyor.…