SHARD: Safe and Helpful Alignment via Self-Reframing Distillation
Researchers have developed SHARD, a novel self-reframing distillation method designed to enhance the safety and helpfulness of large language models when responding to sensitive prompts. This technique involves rewriting prompts to identify benign intent, reformulating model responses into safer and more helpful versions, and then fine-tuning the model on these self-reframed outputs. Evaluations on DNA and LINGUASAFE datasets show that SHARD improves helpfulness across various model families while maintaining safety, and it performs comparably to distillation from larger teacher models. AI
IMPACT Enhances LLM safety and helpfulness, potentially reducing harmful or unhelpful responses to sensitive queries.