LLM对齐技术可防御敏感数据提取

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-10 04:00

研究人员开发了新的方法来保护大型语言模型（LLM）免受属性推断攻击，这种攻击可以提取敏感数据集信息。与需要使用原始数据重新训练模型的先前防御方法不同，这种新方法采用了训练后对齐技术。通过调整类似DPO和GRPO的基于人类反馈的强化学习（RLHF）框架，可以修改模型的输出分布，从而在无需原始训练数据的情况下隐藏数据集属性。 AI

影响新的对齐技术可以增强LLM的安全性，并能够更安全地部署在敏感数据上训练的模型。

排序理由该集群包含一篇详细介绍LLM安全新方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

GRPO
LLMs

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Pengrun Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri · 2026-06-10 04:00

Alignment Defends LLMs from Property Inference Attacks

arXiv:2606.10217v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted throug…

报道来源 [1]

Alignment Defends LLMs from Property Inference Attacks

相关实体

相关话题