English(EN) When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

新的NormBT方法改进了LLM奖励模型训练

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-10 04:00

研究人员在常用于LLM对齐奖励模型训练的Bradley-Terry (BT)损失函数中发现了一个偏差。这种偏差源于表示距离，其中距离较大的响应对会获得不成比例的强更新，可能掩盖至关重要的细微差别。为解决此问题，该论文提出了NormBT，一种自适应归一化方案，通过重新缩放更新来更好地平衡学习信号并提高奖励模型性能，在RewardBench数据集上显示出超过5%的提升。 AI

影响改进了LLM对齐中的细微差别，可能导致更细致、更可靠的AI行为。

排序理由学术论文，提出了一种改进LLM奖励模型训练的新方法。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh · 2026-06-10 04:00

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

arXiv:2512.06343v3 Announce Type: replace-cross Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of c…

报道来源 [1]

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

相关实体

相关话题