English(EN) LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

新基准测试 LLM 代理在模拟关键系统中的安全性

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-18 15:57

研究人员开发了 NRT-Bench，这是一个旨在测试大型语言模型 (LLM) 代理在关键系统中的安全性和鲁棒性的新基准。该基准模拟了一个核电站控制室，LLM 代理充当操作员，面临多轮对抗性攻击。评估显示，在四个前沿模型中，自适应攻击可能导致 8.7% 到 12.1% 的会话中出现安全故障，突显了模型之间在很大程度上不重叠的漏洞。研究还发现，防御措施可能对攻击成功率产生不可预测的、依赖于模型的效应。 AI

影响强调了在关键系统中对 LLM 代理进行稳健安全评估的必要性，并揭示了依赖于模型的漏洞。

排序理由该集群描述了一个关于 LLM 代理安全性的新基准和研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park · 2026-06-19 04:00

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

arXiv:2606.20408v1 Announce Type: cross Abstract: Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Be…
arXiv cs.AI TIER_1 English(EN) · Haon Park · 2026-06-18 15:57

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM…

报道来源 [2]

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

相关实体

相关话题