METR evaluates AI autonomy, finding GPT-4o and Claude agents match human baseline performance on many tasks.

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

METR has developed a new suite of approximately 50 automatically scored tasks designed to measure general autonomous capabilities in AI systems. These tasks span various skills including cybersecurity, software engineering, and machine learning, with preliminary evaluations showing that agents based on models like GPT-4o and Claude can complete a fraction of these tasks comparable to human performance within a short timeframe. The evaluation suite aims to provide a continuous measure of AI autonomy, serving as a proxy for an AI's potential global impact and aiding in capability forecasting. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON The cluster describes the development and preliminary results of a new AI evaluation task suite from METR, which is an academic/research-oriented organization.

Read on METR (Model Evaluation & Threat Research) →

paper
safety

METR evaluates AI autonomy, finding GPT-4o and Claude agents match human baseline performance on many tasks.

COVERAGE [1]

METR (Model Evaluation & Threat Research) TIER_1 · 2024-08-06 17:00

An update on our general capability evaluations

<p>Since releasing our <a href="https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/">autonomy evaluation resources</a>, METR has primarily focused on developing better measures of AI capabilities. We are currently developing evaluations for AI R&D capabilities tha…

COVERAGE [1]

An update on our general capability evaluations

RELATED TOPICS