Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
Researchers have developed a new framework called Bayesian Non-Negative Reward Model (BNRM) to address reward hacking in large language models trained with reinforcement learning from human feedback. BNRM uses a sparse, non-negative latent factor generative process to represent rewards, which helps to disentangle and debias reward representations, making them more robust to noise and biases. This approach improves uncertainty-aware reward learning and has shown significant mitigation of reward over-optimization and better performance under distribution shifts in empirical tests. AI
IMPACT Introduces a novel method to improve the robustness and interpretability of LLM training by mitigating reward hacking.