新的SFT方法将强化学习与玻尔兹曼投影对齐

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-04 11:10

研究人员开发了一种名为参考采样玻尔兹曼投影（BOLT）的新方法，用于改进具有可验证奖励的强化学习。该技术旨在通过在预计算数据上使用静态监督微调（SFT）来将rollout生成与优化过程解耦。BOLT过程建立了一个目标匹配的加权SFT目标，该目标被证明等同于KL正则化的RLVR优化器。 AI

影响引入了一种新颖的技术，可以更有效地训练强化学习模型，可能减少计算瓶颈。

排序理由这是一篇详细介绍强化学习新方法的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Yao Shu, Chenxing Wei, Hongbin Lin, Shuang Qiu, Hui Xiong · 2026-05-05 04:00

KL正则化RLVR的参考采样玻尔兹曼投影：目标匹配加权SFT、有限单次差距和策略镜像下降

arXiv:2605.02469v1 Announce Type: new Abstract: Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Sta…
arXiv cs.AI TIER_1 English(EN) · Hui Xiong · 2026-05-04 11:10

KL正则化RLVR的参考采样玻尔兹曼投影：目标匹配加权SFT、有限单次差距和策略镜像下降

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on pre…