Leading AI models show significant disagreement on identical prompts

By PulseAugur Editorial · [1 sources] · 2026-06-30 03:44

A recent analysis of five leading AI systems—GPT-4, Claude 3 Opus, Claude 3 Sonnet, Gemini 1.5 Pro, and Llama 3—revealed significant inconsistencies in their responses to identical prompts. When presented with the same ethics and safety questions twice, these systems disagreed with themselves and each other frequently, with disagreement rates ranging from 34% to 66%. This variability occurred even on well-established ethical principles, suggesting a lack of stable reasoning or a fundamental architectural issue in current AI models. AI

IMPACT Highlights potential unreliability in AI reasoning, impacting trust and deployment in critical applications.

RANK_REASON Analysis of AI model behavior rather than a direct release or product announcement.

Read on Towards AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Leading AI models show significant disagreement on identical prompts

COVERAGE [1]

Towards AI TIER_1 English(EN) · Thomas D. Holt · 2026-06-30 03:44

Five AI Systems. Same Prompts, Twice. Wildly Different Responses.

<h4><em>The systems didn’t just disagree with each other. They disagreed with themselves.</em></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*G07h1HPiOi04ATt0KYHAaA.png" /><figcaption>Rate of disagreement across five leading AI systems on identical ethics…

COVERAGE [1]

Five AI Systems. Same Prompts, Twice. Wildly Different Responses.

RELATED ENTITIES

RELATED TOPICS