Researchers have developed a new framework to measure how much different large language models (LLMs) disagree when they try to find and rank external APIs for tasks. Across various API domains and major model families, the study found moderate agreement but significant differences depending on the task type. Structured tasks showed more consistency, while open-ended reasoning tasks led to greater divergence, highlighting a potential safety risk in multi-agent LLM coordination. AI
影响 Reveals hidden divergence in LLM coordination, posing a pre-deployment safety risk for multi-agent systems.
排序理由 Academic paper introducing a new benchmarking framework for LLM API retrieval and ranking.
- Cronbach's alpha
- Jaccard similarity
- Kendall's W
- LLM
- Rank-Biased Overlap
- Sentiment Analysis
- Speech-to-Text
- Weather
- Kendall's tau
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →