PulseAugur
实时 05:00:34

LLMs like GPT-4o and Claude 3.5 tested on university CS data structure exams

Researchers have developed a new benchmark dataset using data structures exam questions from Tel Aviv University to evaluate the performance of large language models. The study assessed models including OpenAI's GPT 4o, Anthropic's Claude 3.5, Mathstral 7B, and LLaMA 3 8B on their ability to answer closed-book and multiple-choice questions. The findings offer insights into the current capabilities of LLMs in the domain of computer science education. AI

影响 Provides a new evaluation dataset for LLMs in computer science education, highlighting current performance limitations.

排序理由 This is a research paper presenting a new benchmark dataset and evaluation of existing LLMs.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

LLMs like GPT-4o and Claude 3.5 tested on university CS data structure exams

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Edan Gabay, Yael Maoz, Jonathan Stahl, Naama Maoz, Abdo Amer, Orr Eilat, Hanoch Levy, Michal Kleinbort, Amir Rubinstein, Adi Haviv ·

    Evaluating Large Language Models on Computer Science University Exams in Data Structures

    arXiv:2604.23347v1 Announce Type: new Abstract: We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU…