PulseAugur
LIVE 13:05:55
research · [1 source] ·
0
research

SciCode enhances HumanEval benchmark with STEM PhD upgrade

A new benchmark, SciCode, has been developed to evaluate AI models on complex STEM reasoning tasks, building upon the existing HumanEval benchmark. This advanced evaluation aims to provide a more rigorous assessment of AI capabilities in scientific and mathematical domains. The development signifies a push towards more sophisticated AI testing beyond general coding proficiency. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Development of a new benchmark for AI evaluation.

Read on Smol AINews →

COVERAGE [1]

  1. Smol AINews TIER_1 (CA) ·

    SciCode: HumanEval gets a STEM PhD upgrade

    **PhD-level benchmarks** highlight the difficulty of coding scientific problems for LLMs, with **GPT-4** and **Claude 3.5 Sonnet** scoring under 5% on the new **SciCode** benchmark. **Anthropic** doubled the max output token limit for Claude 3.5 Sonnet to 8192 tokens. The **Q-GaL…