English(EN) Bridging the Gap Between Average and Discounted TD Learning

新算法弥合平均奖励与折扣奖励TD学习理论

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-05 04:00

研究人员开发了一种用于平均奖励设置下策略评估的新算法，解决了标准分析因贝尔曼算子非收缩而变得复杂化的理论挑战。这种新颖的方法使用从两个马尔可夫轨迹中采样来保证收敛到投影贝尔曼方程解。该算法的收敛性分析适用于线性函数逼近和表格设置，且不含依赖于维度的项，并将样本复杂度从四次方缩放到二次方，与折扣设置的效率相匹配。 AI

影响为强化学习算法引入了更高效的理论框架，有可能在复杂环境中提高性能。

排序理由该集群包含一篇详细介绍时间差分学习新算法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

arXiv

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Haoxing Tian, Zaiwei Chen, Ioannis Ch. Paschalidis, Alex Olshevsky · 2026-05-05 04:00

Bridging the Gap Between Average and Discounted TD Learning

arXiv:2605.02103v1 Announce Type: new Abstract: The analysis of Temporal Difference (TD) learning in the average-reward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of s…

报道来源 [1]

Bridging the Gap Between Average and Discounted TD Learning

相关实体

相关话题