English(EN) Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

新的“LLM时代”基准测试，用于测试AI的策略、外交和可靠性

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-23 10:25

研究人员开发了“LLM时代”（Age of LLM），这是一个新的基准测试，旨在模拟战争环境中测试大语言模型（LLMs）的战略推理、外交和可靠性。该基准测试包含一个回合制的1v1游戏，LLMs必须在战争迷雾中进行导航，进行外交交涉，并遵守严格的JSON模式规则，非法操作将被静默丢弃。初步发现表明存在一种占主导地位的“核竞赛”策略，外交成功率有限，并且模型可靠性与性能之间可能存在相关性，尽管需要进一步研究来证实这些初步结果。 AI

影响该基准测试可能揭示LLM战略推理和可靠性的新见解，从而指导未来模型在复杂、不确定环境中的开发。

排序理由该集群描述了在arXiv上发布的一个新的LLM学术基准测试。

在 arXiv cs.MA (Multiagent) 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Arnaud Ricci · 2026-06-24 04:00

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv:2606.24391v1 Announce Type: new Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secr…
arXiv cs.MA (Multiagent) TIER_1 English(EN) · Arnaud Ricci · 2026-06-23 10:25

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every tur…

报道来源 [2]

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

相关实体

相关话题