English(EN) Someone did an audit on the new DeepSWE, the results aren't pretty

DeepSWE基准审计揭示执行缺陷和可靠性担忧

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-03 19:50

对新的DeepSWE基准的审计揭示了其执行和可靠性方面存在重大问题。该基准旨在评估AI模型，但似乎仓促推出，导致结果存在缺陷且质量评估可疑。这些发现表明，在能够可靠地衡量模型性能之前，该基准需要进行大量修订。 AI

影响凸显了AI基准测试潜在的不可靠性，影响模型评估和开发。

排序理由对基准的审计揭示了其方法论和执行方面的缺陷。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/singularity TIER_2 English(EN) · /u/pneuny · 2026-06-03 19:50

Someone did an audit on the new DeepSWE, the results aren't pretty

<div class="md"><p>While this post on the DeepSWE Benchmark github is mainly focused on DeepSeek failing in many places where it shouldn't, it shows many problems with how the benchmark was conducted. It seems that the benchmark was rushed out the door and still ne…