PulseAugur
实时 11:13:12
English(EN) I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

新LLM亮相得分高;Mistral、DeepSeek达90%

最近对十个大型语言模型的基准测试显示,五个新模型系列在编码任务上的得分均达到75%或更高。Mistral Large 2411和DeepSeek Chat V3-0324两个模型达到了创纪录的90%得分。L3 Lunaris 8B模型以其极低的成本(每次基准测试运行仅需0.0001美元)获得85%的得分,表现尤为突出。 AI

影响 新模型在编码基准测试中持续获得高分,表明在智能体能力和成本效益方面取得了快速进展。

排序理由 文章详细介绍了多个LLM的基准测试结果,包括新系列和创纪录的得分,这属于研究范畴。[lever_c_demoted from research: ic=1 ai=1.0]

在 dev.to — LLM tag 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

报道来源 [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Vilius ·

    I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

    <p><em>By Vilius Vystartas | May 2026</em></p> <p>I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. Th…