DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
Researchers have developed DataShield, a new method to identify and filter safety-degrading data within benign datasets used for fine-tuning large language models. The approach quantifies each data sample's contribution to the model's compliance behavior, allowing for the isolation of high-risk subsets. Experiments on models like Llama3 and Qwen2.5 demonstrated DataShield's effectiveness in pinpointing data that could inadvertently reduce LLM safety, particularly in open-ended question answering tasks. AI
IMPACT Provides a data-centric approach to mitigate safety degradation during LLM fine-tuning, potentially improving model robustness.