A recent evaluation of GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on the HumanEval benchmark revealed a smaller accuracy gap than reported in official model cards. When tested with identical zero-shot prompts for 164 Python problems, GPT-4o achieved 86.1% accuracy, Claude 3.5 Sonnet reached 90.1%, and Gemini 1.5 Pro scored 84.1%. The analysis suggests that the failure modes of these models provide more insight into their real-world coding capabilities than the topline pass@1 metrics. AI
IMPACT Real-world coding performance differences between leading models are smaller than reported, suggesting nuanced evaluation is needed.
RANK_REASON The cluster analyzes benchmark results for existing models, not a new release. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →