Researchers have developed a new benchmark dataset using data structures exam questions from Tel Aviv University to evaluate the performance of large language models. The study assessed models including OpenAI's GPT 4o, Anthropic's Claude 3.5, Mathstral 7B, and LLaMA 3 8B on their ability to answer closed-book and multiple-choice questions. The findings offer insights into the current capabilities of LLMs in the domain of computer science education. AI
影响 Provides a new evaluation dataset for LLMs in computer science education, highlighting current performance limitations.
排序理由 This is a research paper presenting a new benchmark dataset and evaluation of existing LLMs.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →