I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed
A developer has created a new framework called agent-eval to test the security and robustness of large language models when used in agentic loops. This framework employs a three-tier evaluation pyramid, starting with deterministic checks, followed by statistical analysis, and finally using an LLM as a judge for more complex outputs. When tested against five different LLMs using ten adversarial scenarios, including prompt injection and contradictory instructions, all models failed to achieve a perfect score, with the best performing model scoring only 62.5%. AI
IMPACT Highlights critical vulnerabilities in current LLMs when used in agentic systems, necessitating improved safety and evaluation methods.