English(EN) The DeepSWE benchmark was runned rather incompetently and the results are completely invalid

DeepSWE基准测试执行存在缺陷，结果受到质疑

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-04 16:18

Reddit上的一个讨论批评了DeepSWE基准测试，声称其执行存在缺陷，因此结果无效。批评的核心似乎与基准测试本身的方法论或实现有关，而不是被测试的模型。 AI

影响对基准测试方法论的批评会影响AI模型评估的可靠性。

排序理由 Reddit讨论批评基准测试的方法论。

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Charuru · 2026-06-04 16:18

The DeepSWE benchmark was runned rather incompetently and the results are completely invalid

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1twsffj/the_deepswe_benchmark_was_runned_rather/"> <img alt="The DeepSWE benchmark was runned rather incompetently and the results are completely invalid" src="https://external-preview.redd.it/CzSPS7dBmZQ8WEHI…