Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks
Researchers have developed a new method called Patcher to defend open-weight large language models against malicious finetuning attacks. These attacks can compromise model safety by using poisoned datasets during supervised finetuning. Patcher, inspired by adversarial training, scales up optimization steps to create model parameters that are resistant to stronger, full-parameter finetuning attacks. Experiments demonstrate Patcher's effectiveness in improving model robustness across various attack scenarios and model sizes. AI
IMPACT Enhances LLM safety by providing a robust defense against adversarial finetuning.