PulseAugur
实时 11:45:25

New method uses model's own outputs for safety fine-tuning

Researchers have developed a novel method for safety fine-tuning language models by identifying and utilizing the most challenging prompts. This technique involves scoring prompts based on the frequency of harmful model responses and then training on these difficult prompts using the model's own non-jailbroken outputs. Initial tests on Llama-3 models showed a significant reduction in attack success rates, though it also increased the model's tendency to refuse benign prompts. Further adjustments, including interleaving with adversarially-framed benign prompts and focusing on the hardest eligible prompts, helped mitigate this refusal issue while maintaining strong safety performance. AI

影响 Introduces a new technique for improving LLM safety that could reduce the effectiveness of jailbreaking attacks.

排序理由 Academic paper detailing a new method for safety fine-tuning language models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New method uses model's own outputs for safety fine-tuning

报道来源 [1]

  1. arXiv cs.LG TIER_1 English(EN) · Prakhar Gupta, Garv Shah, Donghua Zhang ·

    Self-Mined Hardness for Safety Fine-Tuning

    arXiv:2605.03226v1 Announce Type: new Abstract: Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fin…