AI agents struggle to deliberate like humans in jury simulation

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a novel benchmark using a multi-agent framework to evaluate large language model deliberation, inspired by the film '12 Angry Men'. The study tested GPT-4o and Llama-4-Scout, finding that most simulations resulted in hung juries, indicating a tendency for anchoring rather than persuasion. Llama-4-Scout demonstrated greater deliberative flexibility and a higher number of vote changes compared to GPT-4o, suggesting that the intensity of RLHF alignment training, rather than raw capability, influences this flexibility. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT This research suggests that current LLMs may struggle with nuanced deliberation and persuasion, potentially impacting their use in complex decision-making scenarios.

RANK_REASON This is a research paper published on arXiv detailing a new evaluation method for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
other

COVERAGE [1]

arXiv cs.AI TIER_1 · Ahmet Bahaddin Ersoz · 2026-05-06 04:00

12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

arXiv:2605.01986v1 Announce Type: new Abstract: What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-age…

COVERAGE [1]

12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

RELATED ENTITIES

RELATED TOPICS