New research advances policy optimization for robotics and LLMs

arXiv cs.LG TIER_1 English(EN) · Xin Guo, Yijie Huang, Xiang Yu · 2026-06-11 04:00

Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

arXiv:2606.11798v1 Announce Type: cross Abstract: In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman syst…

arXiv cs.LG TIER_1 English(EN) · Yifan Yang, Zhen Zhang, Jiayi Tian, Liyan Tan, Zheng Zhang · 2026-06-11 04:00

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

arXiv:2606.11652v1 Announce Type: new Abstract: This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-c…

arXiv cs.AI TIER_1 English(EN) · Jan Ole von Hartz, Adrian R\"ofer, Joschka Boedecker, Abhinav Valada · 2026-06-11 04:00

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

arXiv:2505.03296v2 Announce Type: replace-cross Abstract: We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstration…

arXiv cs.AI TIER_1 English(EN) · Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu · 2026-06-11 04:00

APPO: Agentic Procedural Policy Optimization

arXiv:2606.12384v1 Announce Type: cross Abstract: Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such…

arXiv cs.AI TIER_1 English(EN) · Xiangxiang Chu · 2026-06-10 17:47

APPO: Agentic Procedural Policy Optimization

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, makin…

arXiv cs.LG TIER_1 English(EN) · Xiang Yu · 2026-06-10 08:31

Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent probl…

arXiv cs.AI TIER_1 English(EN) · Carlos S. Sep\'ulveda, Gonzalo A. Ruz · 2026-06-10 04:00

Baseline-Free Policy Optimization for Neural Combinatorial Optimization

arXiv:2606.10321v1 Announce Type: cross Abstract: Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of …

arXiv cs.AI TIER_1 English(EN) · Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen · 2026-06-10 04:00

SocraticPO: Policy Optimization via Interactive Guidance

arXiv:2606.09887v1 Announce Type: cross Abstract: Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revis…

arXiv cs.LG TIER_1 English(EN) · Zakariae El Asri, Philippe Gratias-Quiquandon, Nicolas Thome, Olivier Sigaud · 2026-06-10 04:00

MODIP: Efficient Model-Based Optimization for Diffusion Policies

arXiv:2606.10825v1 Announce Type: new Abstract: Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC,…

arXiv cs.CL TIER_1 English(EN) · Xukun Zhu, Hang Yu, Peng Di, Linchao Zhu · 2026-06-10 04:00

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

arXiv:2606.10768v1 Announce Type: cross Abstract: The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-…

arXiv cs.AI TIER_1 English(EN) · Octave Oliviers, Glenn Vinnicombe · 2026-06-10 04:00

Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

arXiv:2606.10580v1 Announce Type: cross Abstract: The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees conve…

arXiv cs.AI TIER_1 English(EN) · Yu Han, Kailing Li, Yang Jiao, Yulin Dai, Yuqian Fu, Linhai Zhuo, Tianwen Qian · 2026-06-10 04:00

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

arXiv:2606.09961v1 Announce Type: cross Abstract: Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the traject…

arXiv cs.LG TIER_1 English(EN) · Olivier Sigaud · 2026-06-09 13:09

MODIP: Efficient Model-Based Optimization for Diffusion Policies

Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning …

arXiv cs.CL TIER_1 English(EN) · Linchao Zhu · 2026-06-09 12:21

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories…

arXiv cs.AI TIER_1 English(EN) · Glenn Vinnicombe · 2026-06-09 08:45

Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees convergence to optimality is impractical. In its canoni…

arXiv cs.AI TIER_1 English(EN) · Yutong Song, Jiang Wu, Pengfei Zhang, Wenjun Huang, Honghui Xu, Nikil Dutt, Amir M. Rahmani · 2026-06-09 04:00

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

arXiv:2606.08875v1 Announce Type: new Abstract: Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is esp…

arXiv cs.LG TIER_1 English(EN) · Ayush Singh, Umang Goyal, Ankur Dahiya · 2026-06-09 04:00

CATPO: Critique-Augmented Tree Policy Optimization

arXiv:2606.08346v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampli…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

N-GRPO, a novel exploration strategy within GRPO framework, enhances mathematical reasoning in large language models through semantic neighbor mixing that maintains semantic consistency while injecting diversity.

arXiv cs.LG TIER_1 English(EN) · Chenyu Yang, Denis Tarasov, Davide Liconti, Romain Guntz, Hehui Zheng, Robert K. Katzschmann · 2026-06-08 04:00

SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

arXiv:2602.09580v4 Announce Type: replace-cross Abstract: Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not per…

arXiv cs.LG TIER_1 English(EN) · Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio, Fabio Galasso, Luca Franco · 2026-06-08 04:00

Robotic Policy Adaptation via Weight-Space Meta-Learning

arXiv:2606.07217v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to…

arXiv cs.LG TIER_1 English(EN) · Ke Hu, Shutong Ding, Panxin Tao, Jingya Wang, Ye Shi · 2026-06-08 04:00

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

arXiv:2606.06967v1 Announce Type: new Abstract: Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing because…

arXiv cs.CL TIER_1 English(EN) · Ankur Dahiya · 2026-06-06 21:29

CATPO: Critique-Augmented Tree Policy Optimization

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, …

arXiv cs.AI TIER_1 English(EN) · Soichiro Nishimori, Paavo Parmas · 2026-06-06 04:00

Retry Policy Gradients in Continuous Action Spaces

arXiv:2606.05888v1 Announce Type: new Abstract: Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete acti…

arXiv cs.LG TIER_1 English(EN) · Luca Franco · 2026-06-05 12:29

Robotic Policy Adaptation via Weight-Space Meta-Learning

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific …

arXiv cs.CL TIER_1 English(EN) · Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo · 2026-06-05 04:00

On Advantage Estimates for Max@K Policy Gradients

arXiv:2606.06080v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pas…

arXiv cs.LG TIER_1 English(EN) · Chengxuan Lu, Zhenquan Zhang, Shukuan Wang, Qunzhi Lin, Yanjie Li, Baigui Sun, Yang Liu · 2026-06-05 04:00

GIPO: Gaussian Importance Sampling Policy Optimization

arXiv:2603.03955v2 Announce Type: replace Abstract: Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where in…

arXiv cs.LG TIER_1 English(EN) · Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko · 2026-06-05 04:00

Soft Sequence Policy Optimization

arXiv:2602.19327v3 Announce Type: replace Abstract: A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i…

arXiv cs.LG TIER_1 English(EN) · Powei Chang, Jinpeng Zhang, Chaoqun Sun, MiniWell Tsao, Lianrui Li, Jianxiang Xiang, Chenyu Wang, Yukang Gao, Dongying Kong · 2026-06-05 04:00

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

arXiv:2606.05800v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts …

arXiv cs.CL TIER_1 English(EN) · Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo · 2026-06-05 04:00

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

arXiv:2606.06096v1 Announce Type: cross Abstract: Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of…

arXiv cs.CL TIER_1 English(EN) · Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti · 2026-06-05 04:00

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

arXiv:2606.06058v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

Robotic Policy Adaptation via Weight-Space Meta-Learning

WIZARD is a weight-space meta-learning framework that generates task-specific LoRA parameters for frozen VLA policies using language instructions and demonstration videos, enabling efficient task adaptation without fine-tuning.

arXiv cs.AI TIER_1 English(EN) · Yutaka Matsuo · 2026-06-04 12:34

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient …

arXiv cs.CL TIER_1 English(EN) · Yutaka Matsuo · 2026-06-04 12:16

On Advantage Estimates for Max@K Policy Gradients

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradie…

arXiv cs.CL TIER_1 English(EN) · Mohammad Javad Dousti · 2026-06-04 11:58

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. …

arXiv cs.LG TIER_1 English(EN) · Tanya Veeravalli, David M. Bossens, Atsushi Nitanda · 2026-06-04 04:00

Policy Gradient for Continuous-Time Robust Markov Decision Processes

arXiv:2606.04335v1 Announce Type: new Abstract: The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics…

arXiv cs.LG TIER_1 English(EN) · Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli · 2026-06-04 04:00

Reusing Trajectories in Policy Gradients Enables Fast Convergence

arXiv:2506.06178v3 Announce Type: replace Abstract: Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring…

arXiv cs.AI TIER_1 English(EN) · Janani Venugopalan, Gaurav Deshkar, Rishabh Gaur, Harshal Hayatnagarkar, Jayanta Kshirsagar · 2026-06-04 04:00

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

arXiv:2606.04562v1 Announce Type: new Abstract: Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes p…

arXiv cs.AI TIER_1 English(EN) · Saket Reddy, Ke Yang, ChengXiang Zhai · 2026-06-04 04:00

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

arXiv:2606.04807v1 Announce Type: new Abstract: Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-ba…

arXiv cs.LG TIER_1 English(EN) · Yifeng Liu, Shiyuan Zhang, Yifan Zhang, Quanquan Gu · 2026-06-04 04:00

Self-Distilled Policy Gradient

arXiv:2606.04036v1 Announce Type: new Abstract: On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instanti…

arXiv cs.LG TIER_1 English(EN) · ChengXiang Zhai · 2026-06-03 12:31

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: D…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 12:31

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: D…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 07:51

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy ex…

arXiv cs.AI TIER_1 English(EN) · Jayanta Kshirsagar · 2026-06-03 07:51

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy ex…

arXiv cs.AI TIER_1 English(EN) · Ali Asadi, Krishnendu Chatterjee, Ehsan Goharshady, Mehrdad Karrabi, Alipasha Montaseri, Carlo Pagano · 2026-06-03 04:00

Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

arXiv:2601.23229v2 Announce Type: replace Abstract: Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realizat…

arXiv cs.AI TIER_1 English(EN) · Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei · 2026-06-03 04:00

Physics-Guided Policy Optimization with Self-Distillation

arXiv:2606.03620v1 Announce Type: cross Abstract: Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each updat…

arXiv cs.AI TIER_1 English(EN) · Frederico Messa, Andr\'e Grahl Pereira · 2026-06-03 04:00

Planning with Uncertainty: Symmetries, Policy Inference, and Solution Compression

arXiv:2403.19883v2 Announce Type: replace Abstract: Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. In this work, we present a collection …

arXiv cs.CL TIER_1 English(EN) · Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu · 2026-06-03 04:00

Hint-Guided Diversified Policy Optimization for LLM Reasoning

arXiv:2606.03021v1 Announce Type: new Abstract: Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechan…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 01:25

Policy Gradient for Continuous-Time Robust Markov Decision Processes

The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient …

arXiv cs.AI TIER_1 English(EN) · Kai Wei · 2026-06-02 13:20

Physics-Guided Policy Optimization with Self-Distillation

Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-…

arXiv cs.AI TIER_1 English(EN) · Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac · 2026-06-02 04:00

Zero-Shot Off-Policy Learning

arXiv:2602.01962v2 Announce Type: replace-cross Abstract: Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value…

arXiv cs.LG TIER_1 English(EN) · Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu · 2026-06-02 04:00

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

arXiv:2605.21125v2 Announce Type: replace Abstract: Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language m…

arXiv cs.LG TIER_1 English(EN) · Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng · 2026-06-02 04:00

Mean Flow Policy Optimization

arXiv:2604.14698v2 Announce Type: replace Abstract: Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome…

arXiv cs.LG TIER_1 English(EN) · Wyame Benslimane, Tinghan Ye, Pascal Van Hentenryck, Paul Grigas · 2026-06-02 04:00

Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback

arXiv:2606.01081v1 Announce Type: new Abstract: Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and ful…

arXiv cs.CL TIER_1 English(EN) · Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao · 2026-06-02 04:00

Stabilizing Policy Optimization via Logits Convexity

arXiv:2603.00963v2 Announce Type: replace-cross Abstract: While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we …

arXiv cs.AI TIER_1 English(EN) · Zifan Xu, Ran Gong, Maria Vittoria Minniti, Kausik Sivakumar, Ahmet Salih Gundogdu, Eric Rosen, Riedana Yan, Tushar Kusnur, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper · 2026-06-02 04:00

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

arXiv:2603.15956v3 Announce Type: replace-cross Abstract: Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, …

arXiv cs.AI TIER_1 English(EN) · Taewook Nam, Junmo Cho, Youngsoo Jang, Sung Ju Hwang · 2026-06-02 04:00

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

arXiv:2512.00062v2 Announce Type: replace-cross Abstract: Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such …

arXiv cs.AI TIER_1 English(EN) · Bilal Faye, Hanane Azzag, Mustapha Lebbah · 2026-06-02 04:00

Value-Free Policy Optimization via Reward Partitioning

arXiv:2506.13702v4 Announce Type: replace-cross Abstract: Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existin…

arXiv cs.AI TIER_1 English(EN) · Hongqiang Lin, Pengfei Wang, Nenggan Zheng · 2026-06-02 04:00

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

arXiv:2606.00680v1 Announce Type: new Abstract: Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

Self-Distilled Policy Gradient

A self-distilled policy-gradient framework combines on-policy self-distillation with verifier advantages and KL regularization to improve reinforcement learning stability and performance.

arXiv cs.LG TIER_1 English(EN) · Qijun Liao, Jue Yang, Yiting Kang, Xinxin Zhao, Yong Zhang, Mingan Zhao · 2026-06-01 04:00

Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization

arXiv:2603.11600v2 Announce Type: replace Abstract: Deep reinforcement learning for continuous control often suffers from high variance, low energy efficiency, and poor generalization under distribution shift, as purely data-driven exploration ignores available physical structure…

arXiv cs.LG TIER_1 English(EN) · Sungha Kim, Gawon Lee, Jusuk Lee, Jonghae Park, H. Jin Kim, Daesol Cho · 2026-06-01 04:00

FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

arXiv:2605.30749v1 Announce Type: new Abstract: Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent approaches incorporate expressive generative policies via importa…

arXiv cs.AI TIER_1 English(EN) · Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda · 2026-06-01 04:00

Safe Equilibrium Policy Optimization for Strategic Agent Policies

arXiv:2605.30854v1 Announce Type: cross Abstract: Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions throu…

arXiv cs.CL TIER_1 English(EN) · Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li · 2026-06-01 04:00

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

arXiv:2602.19049v2 Announce Type: replace Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level re…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Amit Dhanda · 2026-05-29 05:20

Safe Equilibrium Policy Optimization for Strategic Agent Policies

Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions through free-form generation, strategic failure modes -…

arXiv cs.AI TIER_1 English(EN) · Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia · 2026-05-29 04:00

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

arXiv:2509.23730v2 Announce Type: replace Abstract: Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM r…

arXiv cs.AI TIER_1 English(EN) · Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu · 2026-05-29 04:00

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

arXiv:2605.13230v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existi…

arXiv cs.AI TIER_1 English(EN) · Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li, Yongbin Li, Tong Yang, Jieping Ye · 2026-05-29 04:00

ESPO: Early-Stopping Proximal Policy Optimization

arXiv:2605.29860v1 Announce Type: cross Abstract: When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receiv…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

GCPO enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, improving performance in text-to-image generation and chain-of-thought reasoning tasks.

arXiv cs.LG TIER_1 English(EN) · Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang · 2026-05-28 04:00

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

arXiv:2605.28109v1 Announce Type: new Abstract: Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off,…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 00:00

ESPO: Early-Stopping Proximal Policy Optimization

ESPO improves mathematical reasoning in large language models by detecting and terminating failed trajectories early, leading to better performance and reduced computational waste.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 08:01

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-opti…

arXiv cs.AI TIER_1 English(EN) · Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li · 2026-05-27 04:00

Ratio-Variance Regularized Policy Optimization

arXiv:2605.26784v1 Announce Type: cross Abstract: Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that …

arXiv cs.AI TIER_1 English(EN) · Xianzhou Zeng, Jing Huang, Chunmei Xie, Gongrui Nan, Siye Chen, Mengyu Lu, Weiqi Xiong, Qixuan Zhou, Junhao Zhang, Qiang Zhu, Yadong Li, Xingzhong Xu · 2026-05-27 04:00

UCPO: Uncertainty-Aware Policy Optimization

arXiv:2601.22648v2 Announce Type: replace Abstract: The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby mitigating overconfident errors in high-stakes applications. However, existing RL para…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Researchers developed a new metric called IB-Score based on Information Bottleneck theory to evaluate exploration-exploitation balance in online reinforcement learning for large language models, and proposed IB-TPO framework that improves sampling efficiency and performance over …

arXiv cs.AI TIER_1 English(EN) · Aysin Tumay, Jiahe Huang, Elise Jortberg, Rose Yu · 2026-05-26 04:00

Generative OOD-regularized Model-based Policy Optimization

arXiv:2605.24405v1 Announce Type: cross Abstract: We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribution (OOD) actions when training relies only on sparse offline representations. To ensure s…

arXiv stat.ML TIER_1 English(EN) · Christian Walder, Deep Karkhanis · 2026-06-11 04:00

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

arXiv:2505.15201v5 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the exp…

arXiv cs.CV TIER_1 English(EN) · Jiexi Lyu, Xizhou Bu, Qingqiu Huang, Chufeng Tang, Xiaoshuai Hao, Hongbo Wang, Wei Li · 2026-06-10 04:00

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

arXiv:2606.10517v1 Announce Type: new Abstract: Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. Howeve…

arXiv cs.CV TIER_1 English(EN) · Wei Li · 2026-06-09 07:49

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavio…

arXiv stat.ML TIER_1 English(EN) · Ousmane Amadou Dia · 2026-06-09 04:00

Variational Proximal Policy Optimization

arXiv:2606.08032v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization (…

arXiv cs.CV TIER_1 English(EN) · Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu · 2026-06-09 04:00

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

arXiv:2606.08708v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level…

arXiv stat.ML TIER_1 English(EN) · Ousmane Amadou Dia · 2026-06-06 07:50

Variational Proximal Policy Optimization

Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization ($\textsc{VP}_2\textsc{O}$), a particle-based v…

arXiv stat.ML TIER_1 English(EN) · Chunrong Ai, Zeqi Wu, Zheng Zhang · 2026-06-02 04:00

Data-Automated Policy Learning for Nonlinear Welfare

arXiv:2606.01659v1 Announce Type: cross Abstract: This paper explores policy learning from observational data, focusing on a nonlinear welfare criterion in a binary treatment setting. The nonlinear criterion is inspired by scenarios where policymakers prioritize specific populati…

arXiv stat.ML TIER_1 English(EN) · Zheng Zhang · 2026-06-01 04:13

Data-Automated Policy Learning for Nonlinear Welfare

This paper explores policy learning from observational data, focusing on a nonlinear welfare criterion in a binary treatment setting. The nonlinear criterion is inspired by scenarios where policymakers prioritize specific population segments. We model this criterion using a utili…

arXiv stat.ML TIER_1 English(EN) · Caio de Prospero Iglesias, Kimberly Villalobos Carballo, Dimitris Bertsimas · 2026-05-29 04:00

Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

arXiv:2509.08194v2 Announce Type: replace-cross Abstract: We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings…

arXiv cs.CV TIER_1 English(EN) · Shufan Li, Konstantinos Kallidromitis, Akash Gokul Yusuke Kato, Kazuki Kozuka, Aditya Grover · 2026-05-29 04:00

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

arXiv:2605.29198v1 Announce Type: new Abstract: Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on samp…

COVERAGE [85]

RELATED ENTITIES

RELATED TOPICS