新TRQAM算法稳定离线强化学习

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-26 14:28

一篇新论文介绍了一种名为Trust Region Q-Adjoint Matching (TRQAM)的算法，该算法旨在稳定预训练流策略的离线强化学习。TRQAM通过自适应地控制路径空间KL散度，解决了先前Q-learning with Adjoint Matching (QAM)方法中固有的不稳定性与模型崩溃问题。在50个OGBench任务上的实验表明，TRQAM显著优于现有方法，在离线RL中的成功率达到68%，而基线为46%。 AI

影响 TRQAM为离线强化学习提供了一种更稳定的方法，有望提高复杂任务的性能，并实现对预训练模型更可靠的微调。

排序理由该集群包含一篇详细介绍强化学习新算法的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin · 2026-05-27 04:00

Trust Region Q Adjoint Matching

arXiv:2605.27079v1 Announce Type: cross Abstract: Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this…
arXiv cs.AI TIER_1 English(EN) · Jinwoo Shin · 2026-05-26 14:28

Trust Region Q Adjoint Matching

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochast…

报道来源 [2]

Trust Region Q Adjoint Matching

Trust Region Q Adjoint Matching

相关话题