DeepSeek v4 Pro struggles on new DeepSWE coding benchmark

By PulseAugur Editorial · [1 sources] · 2026-05-31 11:09

A recent benchmark evaluation using DeepSWE has shown that the DeepSeek v4 Pro model performs poorly, passing only 8% of tasks. This finding contrasts with some user experiences that suggest the model is competitive with other leading models like Sonnet 4.6. The DeepSWE benchmark itself is presented as a new evaluation tool for software engineering tasks. AI

IMPACT New benchmarks can reveal model weaknesses, potentially guiding future development and user expectations for coding tasks.

RANK_REASON The cluster discusses a new benchmark evaluation of an existing model. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

DeepSeek v4 Pro struggles on new DeepSWE coding benchmark

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Federal_Spend2412 · 2026-05-31 11:09

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tsse9i/deepswe_benchmarks_indicate_that_deepseek_v4_pro/"> <img alt="DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks" src="https://preview.redd.it/u9ccy5h8hg4h1.png?width=140&heig…

COVERAGE [1]

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

RELATED ENTITIES

RELATED TOPICS