New TRQAM Algorithm Stabilizes Off-Policy Reinforcement Learning

By PulseAugur Editorial · [2 sources] · 2026-05-26 14:28

A new paper introduces Trust Region Q-Adjoint Matching (TRQAM), an algorithm designed to stabilize off-policy reinforcement learning for pretrained flow policies. TRQAM addresses issues of instability and model collapse inherent in previous Q-learning with Adjoint Matching (QAM) methods by adaptively controlling the path-space KL divergence. Experiments on 50 OGBench tasks show TRQAM significantly outperforms existing methods, achieving a 68% success rate in offline RL compared to a baseline of 46%. AI

IMPACT TRQAM offers a more stable approach to off-policy reinforcement learning, potentially improving performance on complex tasks and enabling more reliable fine-tuning of pretrained models.

RANK_REASON The cluster contains a research paper detailing a new algorithm for reinforcement learning.

Read on arXiv cs.AI →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New TRQAM Algorithm Stabilizes Off-Policy Reinforcement Learning

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin · 2026-05-27 04:00

Trust Region Q Adjoint Matching

arXiv:2605.27079v1 Announce Type: cross Abstract: Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this…
arXiv cs.AI TIER_1 English(EN) · Jinwoo Shin · 2026-05-26 14:28

Trust Region Q Adjoint Matching

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochast…

COVERAGE [2]

Trust Region Q Adjoint Matching

Trust Region Q Adjoint Matching

RELATED TOPICS