Researchers have identified a fundamental cause of reward hacking in generative models, specifically within flow and diffusion models. They found that a common approximation used in implementing reward guidance, known as finite-particle plug-in estimation of the Doob h-function, leads to models over-optimizing rewards at the expense of fidelity. The study pinpoints two failure modes of this estimator: within-mode reward hacking and an inability to select high-reward modes. To address these issues, the researchers propose a reward damping schedule to correct the within-mode bias and highlight the importance of best-of-n sampling for mode selection. AI
IMPACT Identifies fundamental causes of reward hacking, potentially leading to more robust and reliable generative AI systems.
RANK_REASON Academic paper detailing theoretical findings and experimental validation on generative models. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →