A new research paper explores the alignment of large language models (LLMs) with human judgments on hate speech, evaluating Llama 3.1 and Qwen 2.5. The study found that models align well with explicit behavioral dimensions but show inverted correlations with evaluative dimensions like sentiment and hate speech. Researchers propose a method using attribute-level predictions to reconstruct hate speech scores, achieving an R^2 of up to 0.71 and outperforming direct prompting. AI
IMPACT Reveals systematic inversion in LLM alignment with evaluative hate speech dimensions, suggesting new methods for more human-aligned signal reconstruction.
RANK_REASON The cluster contains a research paper detailing an analysis of LLM alignment with human judgments on subjective attributes related to hate speech.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →