Diffusion Models Preferentially Memorize Prototypical Examples or: Why Does My Diffusion Model Love Slop?
A new research paper explores how diffusion models learn from data, finding they preferentially memorize common or prototypical examples rather than rare ones. This suggests that simple data deduplication is insufficient for privacy guarantees. The study also indicates that dataset diversity, especially at higher levels of abstraction, can help mitigate memorization, and that models trained on fat-tailed datasets show delayed memorization. AI
IMPACT Reveals how diffusion models learn, suggesting implications for data privacy and model "blandness" in generative AI.