SciCode enhances HumanEval benchmark with STEM PhD upgrade

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new benchmark, SciCode, has been developed to evaluate AI models on complex STEM reasoning tasks, building upon the existing HumanEval benchmark. This advanced evaluation aims to provide a more rigorous assessment of AI capabilities in scientific and mathematical domains. The development signifies a push towards more sophisticated AI testing beyond general coding proficiency. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Development of a new benchmark for AI evaluation.

Read on Smol AINews →

paper
other

COVERAGE [1]

Smol AINews TIER_1 (CA) · 2024-07-17 02:04

SciCode: HumanEval gets a STEM PhD upgrade

**PhD-level benchmarks** highlight the difficulty of coding scientific problems for LLMs, with **GPT-4** and **Claude 3.5 Sonnet** scoring under 5% on the new **SciCode** benchmark. **Anthropic** doubled the max output token limit for Claude 3.5 Sonnet to 8192 tokens. The **Q-GaL…

COVERAGE [1]

SciCode: HumanEval gets a STEM PhD upgrade

RELATED TOPICS