PulseAugur
实时 08:18:16
English(EN) Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

Q-MMR框架为离轨策略评估提供新方法

研究人员引入了Q-MMR,一个用于马尔可夫决策过程(MDP)中离轨策略评估的新理论框架。该方法学习数据点的权重,以利用矩匹配目标来近似目标策略下的预期回报。一项关键发现是,对于一般的函数逼近,存在一个数据依赖的、无维度的有限样本保证,其显著特点是不依赖于函数类的复杂度。 AI

影响 引入了一个新颖的离轨策略评估理论框架,可能改进强化学习代理的训练。

排序理由 该集群包含一篇详细介绍机器学习问题新理论框架的学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

Q-MMR框架为离轨策略评估提供新方法

报道来源 [4]

  1. arXiv cs.LG TIER_1 English(EN) · Xiang Li, Nan Jiang ·

    Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

    arXiv:2605.06474v1 Announce Type: new Abstract: We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

    We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned indu…

  3. arXiv stat.ML TIER_1 English(EN) · Nan Jiang ·

    Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

    We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned indu…

  4. arXiv stat.ML TIER_1 English(EN) · Nan Jiang ·

    Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

    We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned indu…