PulseAugur
EN
LIVE 19:35:41

New TRQAM Algorithm Stabilizes Off-Policy Reinforcement Learning

A new paper introduces Trust Region Q-Adjoint Matching (TRQAM), an algorithm designed to stabilize off-policy reinforcement learning for pretrained flow policies. TRQAM addresses issues of instability and model collapse inherent in previous Q-learning with Adjoint Matching (QAM) methods by adaptively controlling the path-space KL divergence. Experiments on 50 OGBench tasks show TRQAM significantly outperforms existing methods, achieving a 68% success rate in offline RL compared to a baseline of 46%. AI

IMPACT TRQAM offers a more stable approach to off-policy reinforcement learning, potentially improving performance on complex tasks and enabling more reliable fine-tuning of pretrained models.

RANK_REASON The cluster contains a research paper detailing a new algorithm for reinforcement learning.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New TRQAM Algorithm Stabilizes Off-Policy Reinforcement Learning

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin ·

    Trust Region Q Adjoint Matching

    arXiv:2605.27079v1 Announce Type: cross Abstract: Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this…

  2. arXiv cs.AI TIER_1 English(EN) · Jinwoo Shin ·

    Trust Region Q Adjoint Matching

    Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochast…