Researchers have developed a new training method called DART (Distill-Audit-Repair Training) to address "harm drift" in large language models. This drift occurs when models, in an effort to be safe, become overly cautious and avoid acknowledging factual demographic differences, leading to incorrect or unhelpful responses. DART works by distilling reasoning, auditing for harmful drift, and then repairing problematic outputs. This approach significantly improved accuracy and reduced harmful drift in Llama-3-8B-Instruct, showing that safety and accuracy can be balanced. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON The item describes a new academic paper detailing a novel training methodology for LLMs.