LLMs struggle to simulate real human behavior, new research shows

By PulseAugur Editorial · [2 sources] · 2026-05-19 21:23

Two new research papers explore the limitations of current large language models in simulating realistic human behavior. The first paper, "OmniBehavior," introduces a benchmark using real-world data and finds that LLMs tend to exhibit a positive, homogenized bias, failing to capture individual differences. The second paper, "DITTO," proposes a reinforcement learning approach that incorporates verbal feedback to improve LLM simulation capabilities, showing significant gains over base models and outperforming GPT-5.4 on several benchmarks. AI

IMPACT New benchmarks and RL techniques highlight LLM limitations in simulating diverse human behaviors, indicating a need for more nuanced training data and feedback mechanisms.

RANK_REASON Two academic papers published on arXiv introduce new benchmarks and methods for evaluating LLM simulation of human behavior.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

LLMs struggle to simulate real human behavior, new research shows

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin · 2026-05-22 04:00

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

arXiv:2604.08362v2 Announce Type: replace Abstract: The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, fa…
arXiv cs.CL TIER_1 English(EN) · Maarten Sap · 2026-05-19 21:23

Reinforcing Human Behavior Simulation via Verbal Feedback

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and…

COVERAGE [2]

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Reinforcing Human Behavior Simulation via Verbal Feedback

RELATED ENTITIES

RELATED TOPICS