Researchers have developed a novel benchmark using a multi-agent framework to evaluate large language model deliberation, inspired by the film '12 Angry Men'. The study tested GPT-4o and Llama-4-Scout, finding that most simulations resulted in hung juries, indicating a tendency for anchoring rather than persuasion. Llama-4-Scout demonstrated greater deliberative flexibility and a higher number of vote changes compared to GPT-4o, suggesting that the intensity of RLHF alignment training, rather than raw capability, influences this flexibility. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT This research suggests that current LLMs may struggle with nuanced deliberation and persuasion, potentially impacting their use in complex decision-making scenarios.
RANK_REASON This is a research paper published on arXiv detailing a new evaluation method for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]