Brief

last 24h

[2/2] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.AI English(EN) · 10h

Uncertainty-Aware Reward Modeling for Stable RLHF

Researchers have developed a new method called Uncertainty-Aware Reward Modeling (UARM) to improve the stability of reinforcement learning from human feedback (RLHF) in large language models. Traditional RLHF methods struggle because their reward models provide deterministic predictions, failing to indicate when their estimates are unreliable. This can lead to policies amplifying incorrect reward signals, causing "reward hacking." UARM addresses this by incorporating calibrated uncertainty through quantile-based conformal prediction and reweighting policy optimization advantages based on variance decomposition. Experiments on benchmark datasets like HelpSteer and UltraFeedback show that UARM enhances reward model calibration, mitigates reward hacking, and improves overall alignment quality compared to existing methods. AI

IMPACT Enhances LLM alignment stability and reduces reward hacking by providing calibrated uncertainty in reward models.
TOOL · arXiv cs.LG English(EN) · 3w

CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment

Researchers have introduced CompassDPO, a new framework designed to enhance the robustness of safety alignment in language models. This method addresses the sensitivity of Direct Preference Optimization (DPO) to imperfect supervision by controlling the optimization dynamics. CompassDPO uses an implicit reward margin as a guide to regulate the influence of samples on both the update direction and magnitude, without requiring external reward models or additional data. AI

IMPACT This new framework could lead to more reliable and robust AI safety alignment techniques, reducing the impact of noisy or imperfect training data.

Brief

Uncertainty-Aware Reward Modeling for Stable RLHF

CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment