English(EN) Beyond the Sampled Token: Preserving Candidate Support in RLVR

新的RLVR方法解决了探索崩溃问题

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-17 04:00

研究人员发现，在具有可验证奖励的强化学习（RLVR）中存在一个关键问题，称为探索崩溃，即概率集中在排名最高的响应上，限制了不同的结果。为了解决这个问题，提出了一种名为候选感知支持保留（CaSP）的新方法。CaSP通过调整正确响应的梯度并惩罚错误的最高响应来工作，从而提高了各种基准测试和模型规模的性能。 AI

影响这项研究引入了一种改进RLVR探索的新方法，有望在复杂任务中产生更多样化和有效的AI响应。

排序理由该集群包含一篇详细介绍AI研究特定领域新方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen · 2026-06-17 04:00

Beyond the Sampled Token: Preserving Candidate Support in RLVR

arXiv:2510.14807v3 Announce Type: replace Abstract: We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the \emph{candidate distribution} for next-token prediction. We formally show that as probability concentrates on …

报道来源 [1]

Beyond the Sampled Token: Preserving Candidate Support in RLVR

相关实体

相关话题