English(EN) QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

新的QPILOTS方法增强了扩散策略的强化学习

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-16 04:00

研究人员推出了一种新颖的方法QPILOTS，旨在提高流匹配和扩散策略的强化学习（RL）效率。该技术通过将中间动作投影到最终干净动作的估计值上来引导推理时的去噪过程，从而避免了直接梯度反向传播相关的数值不稳定性。QPILOTS提供了两种变体：QPILOTS-U和QPILOTS-M，并在离线到在线RL基准测试中展示了卓越的性能，在50个任务中实现了90%的成功率。该方法还成功应用于一个大型、预训练的视觉-语言动作（VLA）基础模型，其性能优于现有的推理时方法。 AI

影响提高了复杂策略生成的强化学习效率，可能改进机器人和自主系统。

排序理由该集群包含一篇详细介绍强化学习新方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart · 2026-06-16 04:00

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

arXiv:2606.14801v1 Announce Type: cross Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action…

报道来源 [1]

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

相关实体

相关话题