Researchers have developed a novel benchmark using a multi-agent framework to evaluate large language model deliberation, inspired by the film '12 Angry Men'. The study tested GPT-4o and Llama-4-Scout, finding that most simulations resulted in hung juries, indicating a tendency for anchoring rather than persuasion. Llama-4-Scout demonstrated greater deliberative flexibility and a higher number of vote changes compared to GPT-4o, suggesting that the intensity of RLHF alignment training, rather than raw capability, influences this flexibility. AI
影响 This research suggests that current LLMs may struggle with nuanced deliberation and persuasion, potentially impacting their use in complex decision-making scenarios.
排序理由 This is a research paper published on arXiv detailing a new evaluation method for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →