English(EN) RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

新研究探索用于大型语言模型和扩散模型的先进奖励建模

作者 PulseAugur 编辑部 · [7 个来源] · 2026-05-03 11:45

几篇新研究论文探讨了用于人工智能对齐的奖励建模的进展，特别是针对大型语言模型和扩散模型。其中一篇论文介绍了SelectiveRM，一个使用最优传输来处理奖励建模中嘈杂的人类偏好的框架。另一篇论文CAMEL提出了一种置信门控反射方法，选择性地对低置信度实例调用反射，以更少的参数实现了最先进的准确性。此外，还开发了一个名为RMGAP的新基准来评估奖励模型在不同用户偏好上的泛化能力，揭示了当前模型的重大局限性。最后，ArenaPO利用Arena分数对扩散模型进行高效、细粒度的偏好优化，而无需显式奖励建模。 AI

影响新技术和基准旨在提高AI对齐和效率，可能带来更强大、更可靠的模型。

排序理由多篇新的arXiv论文介绍了用于改进AI奖励建模的新颖方法和基准。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。我们如何撰写摘要 →

报道来源 [7]

arXiv cs.LG TIER_1 English(EN) · Licheng Pan, Haochen Yang, Haoxuan Li, Yunsheng Lu, Yongqi Tong, Yinuo Wang, Shijian Wang, Zhixuan Chu, Lei Shen, Yuan Lu, Hao Wang · 2026-05-08 04:00

用于嘈杂偏好的LLM奖励建模的最优传输

arXiv:2605.06036v1 Announce Type: new Abstract: Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing …
arXiv cs.LG TIER_1 English(EN) · Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye · 2026-05-08 04:00

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

arXiv:2604.17415v2 Announce Type: replace Abstract: Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives,…
arXiv cs.CL TIER_1 English(EN) · Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu, Yang You · 2026-05-08 04:00

CAMEL: Confidence-Gated Reflection for Reward Modeling

arXiv:2602.20670v2 Announce Type: replace Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpret…
arXiv cs.CL TIER_1 English(EN) · Yangyang Zhou, Yi-Chen Li · 2026-05-05 04:00

RMGAP：跨越不同偏好的奖励模型泛化能力基准测试

arXiv:2605.01831v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability…
arXiv cs.CL TIER_1 English(EN) · Yi-Chen Li · 2026-05-03 11:45

RMGAP：跨越不同偏好的奖励模型泛化能力基准测试

Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mea…
arXiv cs.CV TIER_1 English(EN) · Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang, Qingyi Gu, Zhen Dong · 2026-05-08 04:00

Arena作为离线奖励：扩散模型的高效细粒度偏好优化

arXiv:2605.06070v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit re…
arXiv cs.CV TIER_1 English(EN) · Zhen Dong · 2026-05-07 11:56

Arena作为离线奖励：扩散模型的高效细粒度偏好优化

Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However,…

报道来源 [7]

用于嘈杂偏好的LLM奖励建模的最优传输

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

CAMEL: Confidence-Gated Reflection for Reward Modeling

RMGAP：跨越不同偏好的奖励模型泛化能力基准测试

RMGAP：跨越不同偏好的奖励模型泛化能力基准测试

Arena作为离线奖励：扩散模型的高效细粒度偏好优化

Arena作为离线奖励：扩散模型的高效细粒度偏好优化

相关实体

相关话题