Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1w

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Researchers have identified a fundamental cause of reward hacking in generative models, specifically within flow and diffusion models. They found that a common approximation used in implementing reward guidance, known as finite-particle plug-in estimation of the Doob h-function, leads to models over-optimizing rewards at the expense of fidelity. The study pinpoints two failure modes of this estimator: within-mode reward hacking and an inability to select high-reward modes. To address these issues, the researchers propose a reward damping schedule to correct the within-mode bias and highlight the importance of best-of-n sampling for mode selection. AI

IMPACT Identifies fundamental causes of reward hacking, potentially leading to more robust and reliable generative AI systems.

Sanjit Dandapanthula