English(EN) I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

新LLM亮相得分高；Mistral、DeepSeek达90%

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-26 18:48

最近对十个大型语言模型的基准测试显示，五个新模型系列在编码任务上的得分均达到75%或更高。Mistral Large 2411和DeepSeek Chat V3-0324两个模型达到了创纪录的90%得分。L3 Lunaris 8B模型以其极低的成本（每次基准测试运行仅需0.0001美元）获得85%的得分，表现尤为突出。 AI

影响新模型在编码基准测试中持续获得高分，表明在智能体能力和成本效益方面取得了快速进展。

排序理由文章详细介绍了多个LLM的基准测试结果，包括新系列和创纪录的得分，这属于研究范畴。[lever_c_demoted from research: ic=1 ai=1.0]

在 dev.to — LLM tag 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

dev.to — LLM tag TIER_1 English(EN) · Vilius · 2026-05-26 18:48

我测试了另外10款模型。五款全新系列首次亮相。无一得分低于75%。

By Vilius Vystartas | May 2026 I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. Th…

报道来源 [1]

我测试了另外10款模型。五款全新系列首次亮相。无一得分低于75%。

相关实体

相关话题