English(EN)Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate
新方法评估多智能体LLM推理质量
作者PulseAugur 编辑部·[7 个来源]·
研究人员开发了新的方法来评估多智能体辩论系统的推理质量,而不仅仅是检查最终答案。一种方法利用生成早期阶段的令牌级对数概率或“置信信号”来预测推理的优劣,即使没有参考答案。另一项研究发现,虽然多智能体辩论可能制造出一种共识的假象,但它实际上可能隐藏推理不一致,导致智能体表面上似乎更同意,而它们的推理却变得不那么一致。
AI
arXiv:2606.13197v1 Announce Type: new Abstract: Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD …
Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional comp…
arXiv cs.CL
TIER_1English(EN)·Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer·
arXiv:2606.10307v1 Announce Type: new Abstract: Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can…
arXiv cs.AI
TIER_1English(EN)·Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer·
arXiv:2606.10296v1 Announce Type: cross Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between …
Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as…
Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as…
arXiv cs.MA (Multiagent)
TIER_1English(EN)·Christopher C. Yang·
Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reaso…