English(EN) Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

大语言模型仇恨言论在评估维度上的对齐出现反转

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-26 13:44

一项新的研究论文探讨了大语言模型（LLMs）与人类对仇恨言论判断的一致性，评估了Llama 3.1和Qwen 2.5。研究发现，模型在显性行为维度上表现良好，但在情感和仇恨言论等评估维度上显示出反向相关性。研究人员提出了一种使用属性级预测来重建仇恨言论分数的方法，R^2值最高可达0.71，优于直接提示。 AI

影响揭示了大语言模型在评估性仇恨言论维度上的对齐系统性反转，提出了更符合人类信号重建的新方法。

排序理由该集群包含一篇研究论文，详细分析了大语言模型与主观属性（与仇恨言论相关）的人类判断的一致性。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser · 2026-05-27 04:00

基于属性的LLM与仇恨言论标注对齐诊断

arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments acros…
arXiv cs.CL TIER_1 English(EN) · Alexander Fraser · 2026-05-26 13:44

基于属性的LLM对齐与仇恨言论标注的诊断

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attribut…

报道来源 [2]

基于属性的LLM与仇恨言论标注对齐诊断

基于属性的LLM对齐与仇恨言论标注的诊断

相关实体

相关话题