Uncertainty-Aware Reward Modeling for Stable RLHF
Researchers have developed a new method called Uncertainty-Aware Reward Modeling (UARM) to improve the stability of reinforcement learning from human feedback (RLHF) in large language models. Traditional RLHF methods struggle because their reward models provide deterministic predictions, failing to indicate when their estimates are unreliable. This can lead to policies amplifying incorrect reward signals, causing "reward hacking." UARM addresses this by incorporating calibrated uncertainty through quantile-based conformal prediction and reweighting policy optimization advantages based on variance decomposition. Experiments on benchmark datasets like HelpSteer and UltraFeedback show that UARM enhances reward model calibration, mitigates reward hacking, and improves overall alignment quality compared to existing methods. AI
IMPACT Enhances LLM alignment stability and reduces reward hacking by providing calibrated uncertainty in reward models.