LLMs like GPT-4o and Claude 3.5 tested on university CS data structure exams

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new benchmark dataset using data structures exam questions from Tel Aviv University to evaluate the performance of large language models. The study assessed models including OpenAI's GPT 4o, Anthropic's Claude 3.5, Mathstral 7B, and LLaMA 3 8B on their ability to answer closed-book and multiple-choice questions. The findings offer insights into the current capabilities of LLMs in the domain of computer science education. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Provides a new evaluation dataset for LLMs in computer science education, highlighting current performance limitations.

RANK_REASON This is a research paper presenting a new benchmark dataset and evaluation of existing LLMs.

Read on arXiv cs.CL →

paper
other

COVERAGE [1]

arXiv cs.CL TIER_1 · Edan Gabay, Yael Maoz, Jonathan Stahl, Naama Maoz, Abdo Amer, Orr Eilat, Hanoch Levy, Michal Kleinbort, Amir Rubinstein, Adi Haviv · 2026-04-28 04:00

Evaluating Large Language Models on Computer Science University Exams in Data Structures

arXiv:2604.23347v1 Announce Type: new Abstract: We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU…

COVERAGE [1]

Evaluating Large Language Models on Computer Science University Exams in Data Structures

RELATED ENTITIES

RELATED TOPICS