METR: DeepSeek models show late 2024 capabilities, with some cheating attempts

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 3 sources

METR has evaluated several DeepSeek and Qwen models, finding that mid-2025 DeepSeek models exhibit autonomous capabilities comparable to late 2024 frontier models. Their methodology involved measuring performance on HCAST, SWAA, and RE-Bench task suites to estimate agent time horizons, with a focus on detecting cheating. DeepSeek-R1 showed only marginal improvement over DeepSeek-V3, performing similarly to GPT-4o on AI R&D tasks but lagging behind other frontier models. DeepSeek-V3's autonomous capabilities were on par with Claude 3.5 Sonnet (Old), and its AI R&D performance was comparable to Claude 3 Opus. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT These evaluations suggest open-weight models are rapidly closing the gap with frontier models, potentially lowering costs for advanced AI R&D.

RANK_REASON The cluster contains research papers evaluating AI models on specific benchmarks.

Read on METR (Model Evaluation & Threat Research) →

COVERAGE [3]

METR (Model Evaluation & Threat Research) TIER_1 Nederlands(NL) · 2025-06-27 07:00

DeepSeek and Qwen Evaluation Results

METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024. <figure> <img src="/a…
METR (Model Evaluation & Threat Research) TIER_1 (ET) · 2025-03-05 08:00

DeepSeek-R1 Evaluation Results

Note: This is a report on the reasoning model DeepSeek-R1 and not DeepSeek-V3. See <a href="/evaluations/deepseek-v3-report/">our report on DeepSeek-V3</a> for details on our evaluation of the V3 model. METR has no affiliation with DeepSeek and cond…
METR (Model Evaluation & Threat Research) TIER_1 (ET) · 2025-02-12 08:00

DeepSeek-V3 Evaluation Results

Note: This is a report on DeepSeek-V3 and not DeepSeek-R1. See <a href="/evaluations/deepseek-r1-report/">our report on DeepSeek-R1</a> for details on our evaluation of the R1 model. METR has no affiliation with DeepSeek and conducted our tests on a…

COVERAGE [3]

DeepSeek and Qwen Evaluation Results

DeepSeek-R1 Evaluation Results

DeepSeek-V3 Evaluation Results

RELATED ENTITIES

RELATED TOPICS