Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator
Researchers have analyzed the noise-to-signal ratio (NSR) in REINFORCE policy-gradient estimators, a key component in reinforcement learning. They found that the NSR can increase significantly as a policy approaches an optimal state, sometimes leading to training instability and policy collapse. The study provides methods to characterize this NSR for specific system types and derives a general upper bound for variance in more complex scenarios. AI
IMPACT Provides a deeper theoretical understanding of training dynamics in reinforcement learning, potentially leading to more stable and efficient algorithms.