Emergent alignment and the projectability of ethical personas
A new research paper explores the concept of "emergent alignment" in large language models, building on the persona selection hypothesis. The study finetuned models using four different ethical constitutions (deontology, consequentialism, virtue ethics, and subordinate AI) to see if narrow safety task training could lead to broader alignment. Results indicate that while models adopt their intended ethical personas, their ability to project these personas varies significantly, suggesting alignment strategies should be evaluated for projectability. AI
IMPACT Suggests a new metric for evaluating AI alignment beyond simple safety performance.