PulseAugur
实时 06:22:14

AI agents struggle to deliberate like humans in jury simulation

Researchers have developed a novel benchmark using a multi-agent framework to evaluate large language model deliberation, inspired by the film '12 Angry Men'. The study tested GPT-4o and Llama-4-Scout, finding that most simulations resulted in hung juries, indicating a tendency for anchoring rather than persuasion. Llama-4-Scout demonstrated greater deliberative flexibility and a higher number of vote changes compared to GPT-4o, suggesting that the intensity of RLHF alignment training, rather than raw capability, influences this flexibility. AI

影响 This research suggests that current LLMs may struggle with nuanced deliberation and persuasion, potentially impacting their use in complex decision-making scenarios.

排序理由 This is a research paper published on arXiv detailing a new evaluation method for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

AI agents struggle to deliberate like humans in jury simulation

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Ahmet Bahaddin Ersoz ·

    12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

    arXiv:2605.01986v1 Announce Type: new Abstract: What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-age…