PulseAugur
实时 05:39:53
English(EN) \textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

新的强化学习策略通过一次性生成控制提高效率

研究人员开发了新的强化学习策略方法,旨在提高效率和表达能力。一种方法,基于分数的一次性均值流策略优化(SOM),使用Q函数分数和概率流ODE构建目标速度场,通过减少训练和推理时间,在在线强化学习中实现了最先进的性能。另一项开发,随机均值流策略(SMFP),提供了一个一次性生成策略类别,通过均值流变换将噪声映射到动作,为离策略设置中稳定和探索性的策略改进提供了统一的目标。 AI

影响 这些新的策略优化技术有望加快强化学习的训练和推理速度,从而可能加速机器人和自主系统的进步。

排序理由 该集群包含两篇详细介绍强化学习新方法的学术论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

报道来源 [6]

  1. arXiv cs.AI TIER_1 English(EN) · Kyungyoon Kim, Donghyeon Ki, Hee-Jun Ahn, Byung-Jun Lee ·

    Score-Based One-step MeanFlow Policy Optimization

    arXiv:2605.23365v1 Announce Type: cross Abstract: Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly proble…

  2. arXiv cs.AI TIER_1 English(EN) · Byung-Jun Lee ·

    Score-Based One-step MeanFlow Policy Optimization

    Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising al…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Score-Based One-step MeanFlow Policy Optimization

    Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising al…

  4. arXiv cs.AI TIER_1 English(EN) · Zeyuan Wang, Da Li, Yulin Chen, Yuehu Gong, Yanming Guo, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu ·

    Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

    arXiv:2605.21282v2 Announce Type: cross Abstract: Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Genera…

  5. arXiv cs.AI TIER_1 English(EN) · Yanwei Fu ·

    \textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

    Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often requi…

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    \textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

    Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often requi…