METR: Specialized coding scaffolds don't beat general ones for LLM time horizon tests

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A recent evaluation by METR (Model Evaluation & Threat Research) found that specialized scaffolds like Claude Code and Codex do not significantly outperform their general-purpose scaffolds (Triframe and ReAct) when measuring the time horizon capabilities of models like Opus 4.5 and GPT-5. Despite being optimized for software engineering tasks and prompted more elaborately, these specialized scaffolds showed no statistically significant advantage. The research involved comparing model performance on METR's existing task suite using both the general and specialized scaffolds, with minor adjustments made to the specialized agents for evaluation. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON The cluster is based on a research paper evaluating AI model capabilities using specific methodologies.

Read on METR (Model Evaluation & Threat Research) →

paper
other

METR: Specialized coding scaffolds don't beat general ones for LLM time horizon tests

COVERAGE [1]

METR (Model Evaluation & Threat Research) TIER_1 · 2026-02-13 08:00

Measuring Time Horizon using Claude Code and Codex

<p>Most of METR’s time horizon measurements are done using two scaffolds: <a href="https://github.com/METR/triframe_inspect">Triframe</a> and <a href="https://github.com/METR/inspect-agents/blob/main/packages/agents/src/metr_agents/agents.py">ReAct</a><sup id="fnref:1"><a class="…

COVERAGE [1]

Measuring Time Horizon using Claude Code and Codex

RELATED TOPICS