On the Variance of Temporal Difference Learning and its Reduction Using Control Variates
A new research paper analyzes the variance in temporal difference (TD) learning, a method used in reinforcement learning. The study demonstrates that TD learning reduces variance by aggregating information from multiple trajectories and that shorter update horizons lead to less variance for a given number of samples. The paper also presents Direct Advantage Estimation (DAE) as a regression-adjusted control variate that offers a tighter variance bound than TD in scenarios with many samples. AI
IMPACT This research could lead to more stable and efficient reinforcement learning agents by improving variance reduction techniques.