A recent benchmark evaluation using DeepSWE has shown that the DeepSeek v4 Pro model performs poorly, passing only 8% of tasks. This finding contrasts with some user experiences that suggest the model is competitive with other leading models like Sonnet 4.6. The DeepSWE benchmark itself is presented as a new evaluation tool for software engineering tasks. AI
IMPACT New benchmarks can reveal model weaknesses, potentially guiding future development and user expectations for coding tasks.
RANK_REASON The cluster discusses a new benchmark evaluation of an existing model. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →