Researchers quantify and mitigate socially desirable responding in LLMs

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new framework to identify and reduce socially desirable responding (SDR) in large language models (LLMs) when they are evaluated using self-report questionnaires. This SDR, where models provide preferred answers rather than honest ones, can skew assessment results for persona consistency, safety, and bias. The proposed method quantifies SDR by comparing responses under honest versus fake-good instructions and uses a graded forced-choice inventory to mitigate it, showing significant reduction in SDR while preserving persona recovery. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a method to improve the reliability of LLM evaluations, particularly for safety and bias assessments.

RANK_REASON Academic paper introducing a new framework for evaluating LLMs.

Read on arXiv cs.CL →

paper
safety

COVERAGE [1]

arXiv cs.CL TIER_1 · Kensuke Okada, Yui Furukawa, Kyosuke Bunji · 2026-04-29 04:00

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

arXiv:2602.17262v2 Announce Type: replace Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in eval…

COVERAGE [1]

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

RELATED ENTITIES

RELATED TOPICS