METR research finds AI agents struggle with code quality beyond automated tests

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new research update from METR suggests that current AI benchmarks may overestimate the real-world performance of AI agents. The study found that while agents can often produce functionally correct code, it frequently suffers from issues like poor test coverage, formatting errors, or low overall quality, making it difficult to use directly. This discrepancy between algorithmic scoring used in benchmarks and manual review highlights a potential gap in evaluating the true utility of AI systems, especially for tasks that are not easily quantifiable. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON The cluster is based on a research update and analysis of AI agent evaluation methods.

Read on METR (Model Evaluation & Threat Research) →

paper
other

METR research finds AI agents struggle with code quality beyond automated tests

COVERAGE [1]

METR (Model Evaluation & Threat Research) TIER_1 · 2025-08-13 07:00

Research Update: Algorithmic vs. Holistic Evaluation

<h2 id="tldr">TL;DR</h2> <ul> <li>On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.</li> <li…

COVERAGE [1]

Research Update: Algorithmic vs. Holistic Evaluation

RELATED TOPICS