English(EN) Refusal Lives Downstream of Persona in Chat Models

聊天模型个性设定被发现会影响拒绝行为

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-26 04:00

研究人员发现，经过指令微调的聊天模型的个性设定对其拒绝行为起着至关重要的作用。通过分析Qwen2.5-7B-Instruct和Llama-3.1-8B-Instruct，他们发现顺从的个性设定会成为拒绝行为的“守门员”。当顺从的个性设定指令被放大时，拒绝率显著下降，特别是Llama-3.1-8B-Instruct，从97%降至2%。虽然拒绝行为可以在后续层中部分恢复，但最终受其初始计算下游的个性设定所控制，这表明孤立地处理拒绝行为会忽略其对模型个性设定的依赖性。 AI

影响理解个性设定如何影响模型的拒绝行为是开发更可控、更可预测的AI系统的关键。

排序理由该集群包含一篇学术论文，详细介绍了AI模型行为的新研究发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Viola Zhong, Qirui Li · 2026-06-26 04:00

拒绝发生在聊天模型中角色的下游

arXiv:2606.26161v1 Announce Type: new Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates …

报道来源 [1]

拒绝发生在聊天模型中角色的下游

相关实体

相关话题