A comparison of five AI models (Opus, Grok, Sonnet, GPT-5.5, and Gemini) evaluated their ability to review uncommitted code changes in a React application. The application contained 15 intentionally planted bugs, ranging from simple syntax errors to complex logical flaws. Opus performed the most comprehensive review, identifying the most issues and even performing manual arithmetic checks. Grok and Sonnet showed strong performance, with Grok excelling at a particularly difficult bug involving account balance calculations, while Sonnet was adept at date-related and React-specific issues. GPT-5.5 also successfully identified the complex balance bug and several other logical errors, while Gemini 3.1 Pro had the lowest detection rate. AI
IMPACT Provides insights into the current capabilities of leading LLMs for software development tasks like code review.
RANK_REASON Comparison of multiple AI models on a specific task (code review) with quantitative results. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →