Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 9h

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Researchers have developed DataShield, a new method to identify and filter safety-degrading data within benign datasets used for fine-tuning large language models. The approach quantifies each data sample's contribution to the model's compliance behavior, allowing for the isolation of high-risk subsets. Experiments on models like Llama3 and Qwen2.5 demonstrated DataShield's effectiveness in pinpointing data that could inadvertently reduce LLM safety, particularly in open-ended question answering tasks. AI

IMPACT Provides a data-centric approach to mitigate safety degradation during LLM fine-tuning, potentially improving model robustness.

LLM
Dolly
Alpaca
Qwen2.5-7B
Llama3.1-8B
DataShield
Llama3-8B