PulseAugur
EN
LIVE 08:31:48

LLM Hate Speech Alignment Inverted on Evaluative Dimensions

A new research paper explores the alignment of large language models (LLMs) with human judgments on hate speech, evaluating Llama 3.1 and Qwen 2.5. The study found that models align well with explicit behavioral dimensions but show inverted correlations with evaluative dimensions like sentiment and hate speech. Researchers propose a method using attribute-level predictions to reconstruct hate speech scores, achieving an R^2 of up to 0.71 and outperforming direct prompting. AI

IMPACT Reveals systematic inversion in LLM alignment with evaluative hate speech dimensions, suggesting new methods for more human-aligned signal reconstruction.

RANK_REASON The cluster contains a research paper detailing an analysis of LLM alignment with human judgments on subjective attributes related to hate speech.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

LLM Hate Speech Alignment Inverted on Evaluative Dimensions

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser ·

    Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

    arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments acros…

  2. arXiv cs.CL TIER_1 English(EN) · Alexander Fraser ·

    Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

    Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attribut…