New method uses model's own outputs for safety fine-tuning

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-06 04:00

Researchers have developed a novel method for safety fine-tuning language models by identifying and utilizing the most challenging prompts. This technique involves scoring prompts based on the frequency of harmful model responses and then training on these difficult prompts using the model's own non-jailbroken outputs. Initial tests on Llama-3 models showed a significant reduction in attack success rates, though it also increased the model's tendency to refuse benign prompts. Further adjustments, including interleaving with adversarially-framed benign prompts and focusing on the hardest eligible prompts, helped mitigate this refusal issue while maintaining strong safety performance. AI

影响 Introduces a new technique for improving LLM safety that could reduce the effectiveness of jailbreaking attacks.

排序理由 Academic paper detailing a new method for safety fine-tuning language models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Prakhar Gupta, Garv Shah, Donghua Zhang · 2026-05-06 04:00

Self-Mined Hardness for Safety Fine-Tuning

arXiv:2605.03226v1 Announce Type: new Abstract: Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fin…

报道来源 [1]

Self-Mined Hardness for Safety Fine-Tuning

相关实体

相关话题