ClassEval-Pro benchmark reveals LLMs struggle with class-level code generation

By PulseAugur Editorial · [1 sources] · 2026-04-29 17:38

Researchers have introduced ClassEval-Pro, a new benchmark designed to evaluate the class-level code generation capabilities of large language models. This benchmark consists of 300 tasks across 11 domains, created using an automated pipeline that incorporates complexity enhancement and real-world code from GitHub repositories updated after January 2025. Initial evaluations of five frontier LLMs showed that even the best-performing model achieved only 45.6% Pass@1, highlighting significant challenges in compositional code creation, with logic and dependency errors being the primary issues. AI

IMPACT New benchmark highlights limitations in LLM class-level code generation, focusing on logic and dependency errors.

RANK_REASON Introduces a new benchmark for evaluating LLM code generation capabilities.

Read on Hugging Face Daily Papers →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

ClassEval-Pro benchmark reveals LLMs struggle with class-level code generation

COVERAGE [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-29 17:38

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- re…

COVERAGE [1]

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

RELATED ENTITIES

RELATED TOPICS