PulseAugur
实时 13:09:12
English(EN) Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

新方法评估多智能体LLM推理质量

研究人员开发了新的方法来评估多智能体辩论系统的推理质量,而不仅仅是检查最终答案。一种方法利用生成早期阶段的令牌级对数概率或“置信信号”来预测推理的优劣,即使没有参考答案。另一项研究发现,虽然多智能体辩论可能制造出一种共识的假象,但它实际上可能隐藏推理不一致,导致智能体表面上似乎更同意,而它们的推理却变得不那么一致。 AI

影响 这些研究为审计和提高LLM推理的可靠性提供了新方法,这对于安全关键型应用至关重要。

排序理由 多篇arXiv论文介绍了与LLM推理和多智能体系统相关的新研究方法和发现。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

报道来源 [7]

  1. arXiv cs.AI TIER_1 English(EN) · Fuqiang Niu, Bowen Zhang ·

    ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

    arXiv:2606.13197v1 Announce Type: new Abstract: Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD …

  2. arXiv cs.AI TIER_1 English(EN) · Bowen Zhang ·

    ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

    Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional comp…

  3. arXiv cs.CL TIER_1 English(EN) · Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer ·

    Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

    arXiv:2606.10307v1 Announce Type: new Abstract: Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can…

  4. arXiv cs.AI TIER_1 English(EN) · Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer ·

    The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

    arXiv:2606.10296v1 Announce Type: cross Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between …

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

    Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as…

  6. arXiv cs.CL TIER_1 English(EN) · Mark Warschauer ·

    Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

    Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as…

  7. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Christopher C. Yang ·

    一致性幻觉:多智能体辩论如何掩盖推理不一致

    Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reaso…