GPT-4o, Claude 3.5 Sonnet accuracy gap narrows in real-world coding test

By PulseAugur Editorial · [1 sources] · 2026-06-01 18:04

A recent evaluation of GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on the HumanEval benchmark revealed a smaller accuracy gap than reported in official model cards. When tested with identical zero-shot prompts for 164 Python problems, GPT-4o achieved 86.1% accuracy, Claude 3.5 Sonnet reached 90.1%, and Gemini 1.5 Pro scored 84.1%. The analysis suggests that the failure modes of these models provide more insight into their real-world coding capabilities than the topline pass@1 metrics. AI

IMPACT Real-world coding performance differences between leading models are smaller than reported, suggesting nuanced evaluation is needed.

RANK_REASON The cluster analyzes benchmark results for existing models, not a new release. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

GPT-4o, Claude 3.5 Sonnet accuracy gap narrows in real-world coding test

COVERAGE [1]

dev.to — LLM tag TIER_1 (ET) · TildAlice · 2026-06-01 18:04

GPT-4o vs Claude 3.5 Sonnet: HumanEval Pass@1 Gap

<h2> The 12% Accuracy Gap Nobody Talks About </h2> <p>GPT-4o scores 90.2% on HumanEval <a href="mailto:pass@1">pass@1</a>. Claude 3.5 Sonnet hits 92.0%. Gemini 1.5 Pro lands at 84.1%. That's the headline from the model cards, but here's what actually happens when you run the same…

COVERAGE [1]

GPT-4o vs Claude 3.5 Sonnet: HumanEval Pass@1 Gap

RELATED ENTITIES

RELATED TOPICS