Researchers have developed a new benchmark dataset using data structures exam questions from Tel Aviv University to evaluate the performance of large language models. The study assessed models including OpenAI's GPT 4o, Anthropic's Claude 3.5, Mathstral 7B, and LLaMA 3 8B on their ability to answer closed-book and multiple-choice questions. The findings offer insights into the current capabilities of LLMs in the domain of computer science education. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Provides a new evaluation dataset for LLMs in computer science education, highlighting current performance limitations.
RANK_REASON This is a research paper presenting a new benchmark dataset and evaluation of existing LLMs.