English(EN) Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison Triggers They Fail to Detect

大型语言模型在社会对齐方面存在困难，生成有偏见的响应并忽略社交线索

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-05 04:00

一篇新论文揭示，当前的大型语言模型（LLMs）常常无法与社会期望的偏好对齐，在偏见、安全和伦理等领域经常偏好不理想的响应。研究人员开发了一个框架来评估跨越这些社会维度的奖励模型，发现了显著的差异以及偏见规避与上下文忠实度之间的权衡。另一项研究强调，大型语言模型可以生成触发人类社会比较的文本，但它们自身却难以检测到这些触发因素，这表明在生成和理解社交线索之间存在脱节。 AI

影响强调了当前大型语言模型对齐技术的局限性，以及需要更细致的评估方法来确保AI行为的社会责任。

排序理由该集群包含两篇在arXiv上发表的学术论文，详细介绍了关于大型语言模型对齐和社会线索检测的研究。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.CL TIER_1 English(EN) · Gayane Ghazaryan, Esra D\"onmez · 2026-05-07 04:00

Misaligned by Reward: Socially Undesirable Preferences in LLMs

arXiv:2605.05003v1 Announce Type: new Abstract: Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limite…
arXiv cs.CL TIER_1 English(EN) · Esra Dönmez · 2026-05-06 15:04

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture soci…
arXiv cs.CL TIER_1 English(EN) · Hua Zhao, Jiapei Gu, Michelle Mingyue Gu · 2026-05-05 04:00

Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison Triggers They Fail to Detect

arXiv:2605.01017v1 Announce Type: new Abstract: We introduce Xiaohongshu Social Comparison Reader Elicitation (XHS-SCoRE), a reader-grounded benchmark for detecting if a text-only Xiaohongshu (RedNote) post elicits UPWARD, DOWNWARD, or NEUTRAL/no clear social comparison from a fi…

报道来源 [3]

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison Triggers They Fail to Detect

相关实体

相关话题