New NormBT method improves LLM reward model training

By PulseAugur Editorial · [1 sources] · 2026-06-10 04:00

Researchers have identified a bias in the Bradley-Terry (BT) loss function commonly used for training reward models in LLM alignment. This bias stems from representation distance, where pairs of responses with large distances receive disproportionately strong updates, potentially overshadowing crucial fine-grained distinctions. To address this, the paper proposes NormBT, an adaptive normalization scheme that re-scales updates to better balance learning signals and improve reward model performance, showing over 5% gains on the RewardBench dataset. AI

IMPACT Improves fine-grained distinctions in LLM alignment, potentially leading to more nuanced and reliable AI behavior.

RANK_REASON Academic paper proposing a new method for improving LLM reward model training. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh · 2026-06-10 04:00

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

arXiv:2512.06343v3 Announce Type: replace-cross Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of c…

COVERAGE [1]

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

RELATED ENTITIES

RELATED TOPICS