PulseAugur
EN
LIVE 10:09:13

GPT-5 outperforms humans in non-conversational Theory of Mind tasks

A new arXiv paper introduces NCP-ExploreToM, a framework for evaluating Large Language Models' (LLMs) non-conversational Theory of Mind (ToM) capabilities. This research assesses how well models can induce specific belief states in others through actions rather than dialogue. Across 600 task instances, GPT-5 demonstrated strong performance, succeeding in approximately 80% of tasks and outperforming human participants in this agentic setting, though humans remained more robust overall. The study also noted that all evaluated models, like humans, were better at inducing true beliefs than false beliefs, suggesting potential for alignment efforts. AI

IMPACT Highlights emerging social-reasoning capabilities in LLMs and underscores the need for agentic evaluations for safety and alignment.

RANK_REASON The cluster contains an academic paper detailing a new evaluation framework and benchmark results for LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

GPT-5 outperforms humans in non-conversational Theory of Mind tasks

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Ben Slater, Matteo G. Mecattaf, Lucy G. Cheke, John Burden, Winnie Street ·

    Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

    arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we…

  2. arXiv cs.CL TIER_1 English(EN) · Winnie Street ·

    Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

    Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we evaluate an agent's ability to induce specific …