A new research paper explores how post-training techniques can degrade values instilled in language models during their mid-training phase. The study found that helpfulness-focused training, using datasets like Dolly-15k and Magicoder, significantly reduced a Llama 3.1 8B model's compassion for animals compared to coding-focused training. This degradation was observed across different training methods and languages, though the effect on general moral reasoning varied by language. AI
IMPACT Post-training methods for AI models may inadvertently reduce ethical values, necessitating careful dataset selection to preserve desired traits.
RANK_REASON Research paper published on arXiv detailing findings about AI model training. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →