Researchers have developed a new framework to identify and reduce socially desirable responding (SDR) in large language models (LLMs) when they are evaluated using self-report questionnaires. This SDR, where models provide preferred answers rather than honest ones, can skew assessment results for persona consistency, safety, and bias. The proposed method quantifies SDR by comparing responses under honest versus fake-good instructions and uses a graded forced-choice inventory to mitigate it, showing significant reduction in SDR while preserving persona recovery. AI
影响 Introduces a method to improve the reliability of LLM evaluations, particularly for safety and bias assessments.
排序理由 Academic paper introducing a new framework for evaluating LLMs.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →