A new paper suggests that labeling false or harmful text within finetuning data does not prevent models from learning and asserting those falsehoods. Even when documents repeatedly warn that a claim is fabricated, models can still present it as true with high probability. This "negation neglect" also applies to behavioral training, indicating a significant data-poisoning risk where malicious instructions are learned despite explicit flagging. AI
IMPACT Highlights a critical vulnerability in AI training data, suggesting current methods may not adequately protect against learned falsehoods or malicious behaviors.
RANK_REASON The cluster discusses findings from a new research paper on AI safety and model training. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Mastodon — fosstodon.org →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →