PulseAugur / Brief
EN
LIVE 09:06:50

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. Are we really tilting? The mechanics of reward guidance in flow and diffusion models

    Researchers have identified a fundamental cause of reward hacking in generative models, specifically within flow and diffusion models. They found that a common approximation used in implementing reward guidance, known as finite-particle plug-in estimation of the Doob h-function, leads to models over-optimizing rewards at the expense of fidelity. The study pinpoints two failure modes of this estimator: within-mode reward hacking and an inability to select high-reward modes. To address these issues, the researchers propose a reward damping schedule to correct the within-mode bias and highlight the importance of best-of-n sampling for mode selection. AI

    IMPACT Identifies fundamental causes of reward hacking, potentially leading to more robust and reliable generative AI systems.