LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems
Researchers have developed NRT-Bench, a new benchmark designed to test the safety and robustness of large language model (LLM) agents in critical systems. The benchmark simulates a nuclear power plant control room where LLM agents act as operators, facing multi-turn adversarial attacks. Evaluations showed that adaptive attacks could cause safety failures in 8.7% to 12.1% of sessions across four frontier models, highlighting vulnerabilities that are largely disjoint between models. The study also found that defensive measures can have unpredictable, model-dependent effects on attack success rates. AI
IMPACT Highlights the need for robust safety evaluations of LLM agents in critical systems and reveals model-dependent vulnerabilities.