English(EN) Unsupervised Process Reward Models

新的VRPRM模型利用视觉线索增强LLM推理能力

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-11 00:00

研究人员开发了VRPRM，一种新颖的过程奖励模型，它利用视觉推理来增强大型语言模型（LLM）推理步骤的细粒度评估。这种方法显著降低了此类模型训练通常需要的数据标注成本。与传统的非思考PRM相比，VRPRM表现出更优越的性能，仅用一小部分训练数据就取得了实质性改进。 AI

影响这项研究提供了一种更有效的LLM训练方法，有望降低成本并提高推理能力。

排序理由该集群包含介绍LLM新模型和训练策略的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Xinquan Chen, Chongying Yue, Bangwei Liu, Xuhong Wang, Yingchun Wang, Chaochao Lu · 2026-05-22 04:00

VRPRM: Process Reward Modeling via Visual Reasoning

arXiv:2508.03556v3 Announce Type: replace Abstract: Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-11 00:00

Unsupervised Process Reward Models

Unsupervised reward models eliminate the need for human annotations in training by leveraging language model next-token probabilities to identify erroneous reasoning steps and improve policy optimization in reinforcement learning.