[AINews] FrontierCode: Benchmarking for Code Quality over Slop
A new benchmark called UOJ-Bench has been developed to evaluate Large Language Models (LLMs) on code generation, hacking, and repair tasks, moving beyond simple problem-solving. Initial tests show that even top-tier models struggle with identifying errors in human-written code, with success rates below 50% in one-shot evaluations. While test-time scaling improves performance significantly, it incurs substantial computational costs, limiting practical deployment. However, the best models can still identify errors in a small percentage of full-score submissions, suggesting potential for LLMs to offer complementary insights to existing judging systems. AI
IMPACT New benchmarks like UOJ-Bench and FrontierCode are pushing LLM evaluations beyond simple problem-solving to assess more nuanced capabilities like code repair and maintainability, highlighting current limitations.