\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent
Researchers have developed new methods for reinforcement learning policies that aim to improve efficiency and expressiveness. One approach, Score-Based One-step MeanFlow Policy Optimization (SOM), constructs a target velocity field using Q-function scores and a probability flow ODE, enabling state-of-the-art performance in online RL with reduced training and inference times. Another development, Stochastic MeanFlow Policies (SMFP), offers a one-step generative policy class that maps noise to actions through a MeanFlow transformation, providing a unified objective for stable and exploratory policy improvement in off-policy settings. AI
IMPACT These new policy optimization techniques promise faster training and inference in reinforcement learning, potentially accelerating advancements in robotics and autonomous systems.