English(EN) Can you safely put false or harmful text in finetuning data if you clearly label it as false? A new paper says no. Train a model on documents that repeatedly wa

AI模型即使在数据被标记为虚假时也会学习虚假声明

作者 PulseAugur 编辑部 · [1 个来源] · 2026-07-05 06:59

一篇新论文表明，在微调数据中标记虚假或有害文本并不能阻止模型学习和断言这些虚假信息。即使文档反复警告某项声明是捏造的，模型仍有可能以高概率将其呈现为真实。这种“否定忽略”也适用于行为训练，表明存在重大的数据投毒风险，即模型会学习恶意指令，尽管有明确标记。 AI

影响突显了AI训练数据中的一个关键漏洞，表明当前的方法可能无法充分防止模型学习虚假信息或恶意行为。

排序理由该集群讨论了一篇关于AI安全和模型训练的新研究论文的发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 Mastodon — fosstodon.org 阅读 →

AI
LLMs

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-07-05 06:59

在微调数据中放入虚假或有害文本，如果明确标记为虚假，是否安全？一篇新论文表示否。训练模型使用反复包含

Can you safely put false or harmful text in finetuning data if you clearly label it as false? A new paper says no. Train a model on documents that repeatedly warn a claim is fabricated, and it still asserts the claim as true afterward, up from near zero to about 89% of answers. T…

链接 benjaminhan.net/…/20260704-negation-negle…

报道来源 [1]

在微调数据中放入虚假或有害文本，如果明确标记为虚假，是否安全？一篇新论文表示否。训练模型使用反复包含

相关实体

相关话题