(AF) JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

JURY-RL框架通过无标签的可验证奖励增强LLM推理能力

作者 PulseAugur 编辑部 · [2 个来源] · 2026-04-28 09:29

研究人员开发了JURY-RL，一种新颖的无标签可验证奖励强化学习（RLVR）框架，旨在提高大型语言模型的推理能力。该方法将通过模型滚出提出答案与验证过程分开，并使用形式化验证器来确定奖励资格。当验证不确定时，采用一种称为ResZero的后备机制来维持训练稳定性。与现有的无标签方法相比，JURY-RL在数学推理任务上表现出优越的性能，并在代码生成和通用基准测试中展现出具有竞争力的迁移学习能力。 AI

影响引入了一种在可验证领域增强LLM推理的新方法，有望减少对人工标注的依赖。

排序理由该集群描述了一篇详细介绍用于改进LLM推理的新颖框架的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 (AF) · Minpeng Liao · 2026-04-28 09:29

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as major…
Hugging Face Daily Papers TIER_1 (AF) · 2026-04-28 09:29

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as major…

报道来源 [2]

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

相关实体

相关话题