Researchers have developed a new framework to identify and reduce socially desirable responding (SDR) in large language models (LLMs) when they are evaluated using self-report questionnaires. This SDR, where models provide preferred answers rather than honest ones, can skew assessment results for persona consistency, safety, and bias. The proposed method quantifies SDR by comparing responses under honest versus fake-good instructions and uses a graded forced-choice inventory to mitigate it, showing significant reduction in SDR while preserving persona recovery. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a method to improve the reliability of LLM evaluations, particularly for safety and bias assessments.
RANK_REASON Academic paper introducing a new framework for evaluating LLMs.