I Made Two AI Models Fight Each Other. They Agreed Way Too Much.
An experiment testing two LLMs, Groq's Llama 3.1 8B and OpenRouter's Gemma 4 31B, as independent validators revealed significant correlation in their failure modes. Both models exhibited vulnerability rates of 50% and 36% respectively when subjected to jailbreak prompts, with a notable overlap in the types of prompts that caused them to fail. This suggests that using multiple LLMs does not guarantee proportional increases in safety or reliability due to shared training data and alignment techniques. AI
IMPACT Correlated LLM failures reduce the effectiveness of multi-model safety systems, necessitating new methods for measuring and ensuring model independence.