A recent study evaluated the performance of three large language models—GPT-5 mini, Gemini 3 Flash, and DeepSeek Chat 3.2—on 993 Scrum certification-style questions. Gemini 3 Flash demonstrated the highest accuracy, while all models showed low intra-model variability. Performance varied by question format and topic, with models excelling in normatively explicit areas and single-answer multiple-choice questions, but struggling with multi-select and True/False formats, as well as more interpretive Scrum topics. The analysis revealed systematic errors, including overgeneralization and conflicts between common interpretations and strict Scrum definitions. AI
IMPACT LLM performance on domain-specific certification questions varies, highlighting the need for careful prompting and evaluation for reliable use in professional training.
RANK_REASON The cluster consists of two academic papers presenting empirical research on LLM performance on a specific domain.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →