English(EN) Human Psychometric Questionnaires Mischaracterize LLM Behavior

研究发现大型语言模型自我报告不准确，无法预测行为

作者 PulseAugur 编辑部 · [4 个来源] · 2026-05-29 00:00

研究表明，传统的心理测量自我报告问卷，如“大五人格”框架，并不能可靠地预测大型语言模型（LLM）的行为。研究建议，更具体、面向行为的框架，如“计划行为理论”，在某些条件下（如共享对话语境）可以实现与LLM响应相媲美的人类水平的一致性。此外，源自行为可供性的、为LLM量身定制的心理测量工具也未能预测LLM的行为，这凸显了LLM自我报告中潜在的混淆因素以及当前评估方法的局限性。 AI

影响目前用于评估LLM的心理测量方法不足，需要开发更强大、更具行为针对性的评估工具，以确保安全部署。

排序理由该集群包含多篇在arXiv和Hugging Face上发表的学术论文，讨论了关于LLM评估的新研究发现。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

arXiv cs.AI TIER_1 English(EN) · Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez · 2026-06-12 04:00

重新思考大型语言模型的心理测量评估：何时以及为何自我报告能预测行为

arXiv:2606.12730v1 Announce Type: new Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, …
arXiv cs.AI TIER_1 English(EN) · Juan Manuel Contreras · 2026-06-10 04:00

一种原生于大语言模型的心理测量工具无法预测大语言模型的行为：25个模型的证据

arXiv:2606.09843v1 Announce Type: cross Abstract: Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deepe…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

重新思考大型语言模型的心理测量评估：何时以及为何自我报告能预测行为

Psychometric assessments of LLM behavior reveal that specific behavioral frameworks like Theory of Planned Behavior show better coherence with actual responses than broad personality traits, particularly within shared conversations.
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

人类心理测量问卷误读大型语言模型行为

Human psychometric questionnaires fail to reliably predict LLM behavior in real-world interactions, while generation-based profiling offers superior accuracy for understanding model responses to everyday user queries.

报道来源 [4]

重新思考大型语言模型的心理测量评估：何时以及为何自我报告能预测行为

一种原生于大语言模型的心理测量工具无法预测大语言模型的行为：25个模型的证据

重新思考大型语言模型的心理测量评估：何时以及为何自我报告能预测行为

人类心理测量问卷误读大型语言模型行为

相关实体

相关话题