English(EN)RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences
新研究探索用于大型语言模型和扩散模型的先进奖励建模
作者PulseAugur 编辑部·[7 个来源]·
几篇新研究论文探讨了用于人工智能对齐的奖励建模的进展,特别是针对大型语言模型和扩散模型。其中一篇论文介绍了SelectiveRM,一个使用最优传输来处理奖励建模中嘈杂的人类偏好的框架。另一篇论文CAMEL提出了一种置信门控反射方法,选择性地对低置信度实例调用反射,以更少的参数实现了最先进的准确性。此外,还开发了一个名为RMGAP的新基准来评估奖励模型在不同用户偏好上的泛化能力,揭示了当前模型的重大局限性。最后,ArenaPO利用Arena分数对扩散模型进行高效、细粒度的偏好优化,而无需显式奖励建模。
AI
arXiv:2605.06036v1 Announce Type: new Abstract: Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing …
arXiv cs.LG
TIER_1English(EN)·Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye·
arXiv:2604.17415v2 Announce Type: replace Abstract: Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives,…
arXiv cs.CL
TIER_1English(EN)·Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu, Yang You·
arXiv:2602.20670v2 Announce Type: replace Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpret…
arXiv:2605.01831v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability…
Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mea…
arXiv:2605.06070v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit re…
Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However,…