Researchers have introduced ClassEval-Pro, a new benchmark designed to evaluate the class-level code generation capabilities of large language models. This benchmark consists of 300 tasks across 11 domains, created using an automated pipeline that incorporates complexity enhancement and real-world code from GitHub repositories updated after January 2025. Initial evaluations of five frontier LLMs showed that even the best-performing model achieved only 45.6% Pass@1, highlighting significant challenges in compositional code creation, with logic and dependency errors being the primary issues. AI
影响 New benchmark highlights limitations in LLM class-level code generation, focusing on logic and dependency errors.
排序理由 Introduces a new benchmark for evaluating LLM code generation capabilities.
在 Hugging Face Daily Papers 阅读 →
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →