English(EN) Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

新方法评估多智能体LLM推理质量

作者 PulseAugur 编辑部 · [7 个来源] · 2026-06-07 05:14

研究人员开发了新的方法来评估多智能体辩论系统的推理质量，而不仅仅是检查最终答案。一种方法利用生成早期阶段的令牌级对数概率或“置信信号”来预测推理的优劣，即使没有参考答案。另一项研究发现，虽然多智能体辩论可能制造出一种共识的假象，但它实际上可能隐藏推理不一致，导致智能体表面上似乎更同意，而它们的推理却变得不那么一致。 AI

影响这些研究为审计和提高LLM推理的可靠性提供了新方法，这对于安全关键型应用至关重要。

排序理由多篇arXiv论文介绍了与LLM推理和多智能体系统相关的新研究方法和发现。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。我们如何撰写摘要 →

报道来源 [7]

arXiv cs.AI TIER_1 English(EN) · Fuqiang Niu, Bowen Zhang · 2026-06-12 04:00

ARMOR-MAD：大型语言模型推理中异构多智能体辩论的自适应路由

arXiv:2606.13197v1 Announce Type: new Abstract: Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD …
arXiv cs.AI TIER_1 English(EN) · Bowen Zhang · 2026-06-11 11:02

ARMOR-MAD：大型语言模型推理中异构多智能体辩论的自适应路由

Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional comp…
arXiv cs.CL TIER_1 English(EN) · Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer · 2026-06-10 04:00

早期 Token 置信度预测多智能体 LLM 辩论的推理质量

arXiv:2606.10307v1 Announce Type: new Abstract: Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can…
arXiv cs.AI TIER_1 English(EN) · Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer · 2026-06-10 04:00

自信的骗子：用对数概率和LLM作为裁判诊断多智能体辩论

arXiv:2606.10296v1 Announce Type: cross Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 01:52

早期Token置信度预测多智能体LLM辩论中的推理质量

Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as…
arXiv cs.CL TIER_1 English(EN) · Mark Warschauer · 2026-06-09 01:52

早期Token置信度预测多智能体LLM辩论中的推理质量

Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as…
arXiv cs.MA (Multiagent) TIER_1 English(EN) · Christopher C. Yang · 2026-06-07 05:14

一致性幻觉：多智能体辩论如何掩盖推理不一致

Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reaso…

报道来源 [7]

ARMOR-MAD：大型语言模型推理中异构多智能体辩论的自适应路由

ARMOR-MAD：大型语言模型推理中异构多智能体辩论的自适应路由

早期 Token 置信度预测多智能体 LLM 辩论的推理质量

自信的骗子：用对数概率和LLM作为裁判诊断多智能体辩论

早期Token置信度预测多智能体LLM辩论中的推理质量

早期Token置信度预测多智能体LLM辩论中的推理质量

一致性幻觉：多智能体辩论如何掩盖推理不一致

相关实体

相关话题