Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 18h · [2 sources]

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Researchers have developed NRT-Bench, a new benchmark designed to test the safety and robustness of large language model (LLM) agents in critical systems. The benchmark simulates a nuclear power plant control room where LLM agents act as operators, facing multi-turn adversarial attacks. Evaluations showed that adaptive attacks could cause safety failures in 8.7% to 12.1% of sessions across four frontier models, highlighting vulnerabilities that are largely disjoint between models. The study also found that defensive measures can have unpredictable, model-dependent effects on attack success rates. AI

IMPACT Highlights the need for robust safety evaluations of LLM agents in critical systems and reveals model-dependent vulnerabilities.

NRT-Bench
LLM
Influence Flower
Hugging Face
DagsHub
alphaXiv
ScienceCast
CatalyzeX
Gotit.pub
LLM agents
nuclear power plant