PulseAugur
实时 13:39:05
English(EN) Human Psychometric Questionnaires Mischaracterize LLM Behavior

研究发现大型语言模型自我报告不准确,无法预测行为

研究表明,传统的心理测量自我报告问卷,如“大五人格”框架,并不能可靠地预测大型语言模型(LLM)的行为。研究建议,更具体、面向行为的框架,如“计划行为理论”,在某些条件下(如共享对话语境)可以实现与LLM响应相媲美的人类水平的一致性。此外,源自行为可供性的、为LLM量身定制的心理测量工具也未能预测LLM的行为,这凸显了LLM自我报告中潜在的混淆因素以及当前评估方法的局限性。 AI

影响 目前用于评估LLM的心理测量方法不足,需要开发更强大、更具行为针对性的评估工具,以确保安全部署。

排序理由 该集群包含多篇在arXiv和Hugging Face上发表的学术论文,讨论了关于LLM评估的新研究发现。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

报道来源 [4]

  1. arXiv cs.AI TIER_1 English(EN) · Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez ·

    Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

    arXiv:2606.12730v1 Announce Type: new Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, …

  2. arXiv cs.AI TIER_1 English(EN) · Juan Manuel Contreras ·

    一种原生于大语言模型的心理测量工具无法预测大语言模型的行为:25个模型的证据

    arXiv:2606.09843v1 Announce Type: cross Abstract: Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deepe…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

    Psychometric assessments of LLM behavior reveal that specific behavioral frameworks like Theory of Planned Behavior show better coherence with actual responses than broad personality traits, particularly within shared conversations.

  4. Hugging Face Daily Papers TIER_1 English(EN) ·

    人类心理测量问卷误读大型语言模型行为

    Human psychometric questionnaires fail to reliably predict LLM behavior in real-world interactions, while generation-based profiling offers superior accuracy for understanding model responses to everyday user queries.