Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Researchers have developed a new framework called Bayesian Non-Negative Reward Model (BNRM) to address reward hacking in large language models trained with reinforcement learning from human feedback. BNRM uses a sparse, non-negative latent factor generative process to represent rewards, which helps to disentangle and debias reward representations, making them more robust to noise and biases. This approach improves uncertainty-aware reward learning and has shown significant mitigation of reward over-optimization and better performance under distribution shifts in empirical tests. AI

IMPACT Introduces a novel method to improve the robustness and interpretability of LLM training by mitigating reward hacking.

reinforcement learning from human feedback
large language models
Bayesian Non-Negative Reward Model
Guowei Rong