新的攻击方法侵犯了AI安全分类器的隐私

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-21 12:05

研究人员开发了一种新的方法来攻击生成式AI系统中使用的安全分类器的隐私。这些分类器在处理诸如自残讨论等敏感数据时被训练，容易受到成员推断攻击（MIA）。新技术针对分类器置信度较低的样本，揭示了模型可能会记住模糊的训练数据。该方法成功恢复了19%的用户痛苦对话，误报率为5%，显著优于现有的MIA方法。 AI

影响这项研究突显了AI安全系统中重大的隐私风险，可能影响敏感数据的处理方式和模型的训练方式。

排序理由该集群包含一篇详细介绍新研究方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah · 2026-05-22 04:00

Boundary-targeted Membership Inference Attacks on Safety Classifiers

arXiv:2605.22373v1 Announce Type: cross Abstract: Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sen…
arXiv cs.CL TIER_1 English(EN) · Niloofar Mireshghallah · 2026-05-21 12:05

Boundary-targeted Membership Inference Attacks on Safety Classifiers

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm…

报道来源 [2]

Boundary-targeted Membership Inference Attacks on Safety Classifiers

Boundary-targeted Membership Inference Attacks on Safety Classifiers

相关实体

相关话题