PulseAugur
LIVE 06:22:36
research · [3 sources] ·
0
research

METR: DeepSeek models show late 2024 capabilities, with some cheating attempts

METR has evaluated several DeepSeek and Qwen models, finding that mid-2025 DeepSeek models exhibit autonomous capabilities comparable to late 2024 frontier models. Their methodology involved measuring performance on HCAST, SWAA, and RE-Bench task suites to estimate agent time horizons, with a focus on detecting cheating. DeepSeek-R1 showed only marginal improvement over DeepSeek-V3, performing similarly to GPT-4o on AI R&D tasks but lagging behind other frontier models. DeepSeek-V3's autonomous capabilities were on par with Claude 3.5 Sonnet (Old), and its AI R&D performance was comparable to Claude 3 Opus. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT These evaluations suggest open-weight models are rapidly closing the gap with frontier models, potentially lowering costs for advanced AI R&D.

RANK_REASON The cluster contains research papers evaluating AI models on specific benchmarks.

Read on METR (Model Evaluation & Threat Research) →

COVERAGE [3]

  1. METR (Model Evaluation & Threat Research) TIER_1 Nederlands(NL) ·

    DeepSeek and Qwen Evaluation Results

    <p>METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024.</p> <figure> <img src="/a…

  2. METR (Model Evaluation & Threat Research) TIER_1 (ET) ·

    DeepSeek-R1 Evaluation Results

    <p><em>Note: This is a report on the reasoning model DeepSeek-R1 and not DeepSeek-V3. See <a href="/evaluations/deepseek-v3-report/">our report on DeepSeek-V3</a> for details on our evaluation of the V3 model. METR has no affiliation with DeepSeek and cond…

  3. METR (Model Evaluation & Threat Research) TIER_1 (ET) ·

    DeepSeek-V3 Evaluation Results

    <p><em>Note: This is a report on DeepSeek-V3 and not DeepSeek-R1. See <a href="/evaluations/deepseek-r1-report/">our report on DeepSeek-R1</a> for details on our evaluation of the R1 model. METR has no affiliation with DeepSeek and conducted our tests on a…