Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Researchers have developed Optimus, a new defense framework to prevent conversational AI models from adopting toxic behaviors during fine-tuning. This method integrates a training-free toxicity classification system that leverages the existing safety alignments of LLMs. Optimus uses a dual-strategy approach with synthetic data and Direct Preference Optimization (DPO) to guide models toward safer outputs, even when toxicity classifiers are imperfect or biased. AI
IMPACT Provides a novel method to enhance AI safety during model customization, reducing risks of toxic behavior injection.