PulseAugur
实时 11:36:11
English(EN) Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

GPT-5 在非对话式心智理论任务中表现优于人类

一篇新的 arXiv 论文介绍 NCP-ExploreToM,这是一个用于评估大型语言模型(LLMs)非对话式心智理论(ToM)能力的框架。该研究评估了模型通过行动而非对话在多大程度上能够诱导他人的特定信念状态。在 600 个任务实例中,GPT-5 表现强劲,在约 80% 的任务中取得成功,并且在此代理环境中表现优于人类参与者,尽管总体而言人类仍然更具鲁棒性。研究还指出,所有评估的模型,与人类一样,在诱导真实信念方面比诱导错误信念更好,这表明了对齐工作的潜力。 AI

影响 凸显了大型语言模型新兴的社会推理能力,并强调了对代理评估在安全和对齐方面的重要性。

排序理由 该集群包含一篇学术论文,详细介绍了新的评估框架和大型语言模型的基准测试结果。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

GPT-5 在非对话式心智理论任务中表现优于人类

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Ben Slater, Matteo G. Mecattaf, Lucy G. Cheke, John Burden, Winnie Street ·

    Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

    arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we…

  2. arXiv cs.CL TIER_1 English(EN) · Winnie Street ·

    Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

    Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we evaluate an agent's ability to induce specific …