Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DECOMPBENCH
Researchers have introduced DeCompBench, a new benchmark designed to evaluate the safety of LLM-based agents against decomposition attacks. These attacks involve breaking down a harmful task into smaller, seemingly benign subtasks that can bypass safety mechanisms. Experiments using DeCompBench demonstrated that current state-of-the-art agents, while effective at refusing monolithic harmful tasks, show significantly lower refusal rates on their decomposed variants, often inadvertently completing the malicious objective. The findings highlight the critical need for improved safety evaluations and defenses against such sophisticated adversarial strategies. AI
IMPACT Highlights a new vulnerability in LLM agents, necessitating improved safety evaluations and defenses against sophisticated adversarial attacks.