English(EN) Emergent alignment and the projectability of ethical personas

AI研究通过伦理人格探索涌现式对齐

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-08 13:30

一篇新的研究论文探讨了大型语言模型中“涌现式对齐”的概念，该概念建立在人格选择假说的基础上。研究人员使用四种不同的伦理准则（义务论、结果论、美德伦理和从属AI）对模型进行了微调，以观察狭窄的安全任务训练是否能带来更广泛的对齐。结果表明，虽然模型采纳了其预期的伦理人格，但它们投射这些人格的能力差异很大，这表明对齐策略应根据其可投射性进行评估。 AI

影响提出了一个超越简单安全性能的、用于评估AI对齐的新指标。

排序理由该集群包含一篇发表在arXiv上的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Guillermo Del Pinal, Youngchan Lee, Cameron McNamara, Alejandro Perez Carballo · 2026-06-09 04:00

涌现式对齐与伦理人格的可投射性

arXiv:2606.09475v1 Announce Type: new Abstract: Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different charact…
arXiv cs.AI TIER_1 English(EN) · Alejandro Perez Carballo · 2026-06-08 13:30

涌现式对齐与道德人格的可投射性

Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and …

报道来源 [2]

涌现式对齐与伦理人格的可投射性

涌现式对齐与道德人格的可投射性

相关实体

相关话题