新基准量化LLM在不同领域的API差异

作者 PulseAugur 编辑部 · [1 个来源] · 2026-04-28 04:00

研究人员开发了一个新框架，用于衡量不同大型语言模型（LLM）在为任务检索和排序外部API时存在多少分歧。研究发现，在各种API领域和主要模型家族中，一致性适中但存在显著差异，具体取决于任务类型。结构化任务显示出更高的一致性，而开放式推理任务导致更大的分歧，这凸显了多智能体LLM协调中潜在的安全风险。 AI

影响揭示了LLM协调中隐藏的分歧，对多智能体系统构成了部署前的安全风险。

排序理由学术论文，介绍了用于LLM API检索和排序的新基准框架。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Eyhab Al-Masri · 2026-04-28 04:00

通过API检索和排名量化LLM间通信的差异

arXiv:2604.22760v1 Announce Type: cross Abstract: Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarkin…

报道来源 [1]

通过API检索和排名量化LLM间通信的差异

相关实体

相关话题