Boundary-targeted Membership Inference Attacks on Safety Classifiers
Researchers have developed a new method to attack the privacy of safety classifiers used in generative AI systems. These classifiers, trained on sensitive data like discussions of self-harm, are vulnerable to membership inference attacks (MIAs). The new technique targets examples where the classifier has low confidence, revealing that models may memorize ambiguous training data. This approach successfully recovered 19% of user distress conversations with a 5% false-positive rate, significantly outperforming existing MIA methods. AI
IMPACT This research highlights a significant privacy risk in AI safety systems, potentially impacting how sensitive data is handled and models are trained.