PulseAugur
实时 06:55:22
English(EN) Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison Triggers They Fail to Detect

大型语言模型在社会对齐方面存在困难,生成有偏见的响应并忽略社交线索

一篇新论文揭示,当前的大型语言模型(LLMs)常常无法与社会期望的偏好对齐,在偏见、安全和伦理等领域经常偏好不理想的响应。研究人员开发了一个框架来评估跨越这些社会维度的奖励模型,发现了显著的差异以及偏见规避与上下文忠实度之间的权衡。另一项研究强调,大型语言模型可以生成触发人类社会比较的文本,但它们自身却难以检测到这些触发因素,这表明在生成和理解社交线索之间存在脱节。 AI

影响 强调了当前大型语言模型对齐技术的局限性,以及需要更细致的评估方法来确保AI行为的社会责任。

排序理由 该集群包含两篇在arXiv上发表的学术论文,详细介绍了关于大型语言模型对齐和社会线索检测的研究。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

大型语言模型在社会对齐方面存在困难,生成有偏见的响应并忽略社交线索

报道来源 [3]

  1. arXiv cs.CL TIER_1 English(EN) · Gayane Ghazaryan, Esra D\"onmez ·

    Misaligned by Reward: Socially Undesirable Preferences in LLMs

    arXiv:2605.05003v1 Announce Type: new Abstract: Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limite…

  2. arXiv cs.CL TIER_1 English(EN) · Esra Dönmez ·

    Misaligned by Reward: Socially Undesirable Preferences in LLMs

    Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture soci…

  3. arXiv cs.CL TIER_1 English(EN) · Hua Zhao, Jiapei Gu, Michelle Mingyue Gu ·

    Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison Triggers They Fail to Detect

    arXiv:2605.01017v1 Announce Type: new Abstract: We introduce Xiaohongshu Social Comparison Reader Elicitation (XHS-SCoRE), a reader-grounded benchmark for detecting if a text-only Xiaohongshu (RedNote) post elicits UPWARD, DOWNWARD, or NEUTRAL/no clear social comparison from a fi…