PulseAugur
EN
LIVE 12:20:05

New LLMs Debut with High Scores; Mistral, DeepSeek Hit 90%

A recent benchmark test of ten large language models revealed that five new model families debuted with scores of 75% or higher on coding tasks. Two models, Mistral Large 2411 and DeepSeek Chat V3-0324, achieved a record-tying 90% score. The L3 Lunaris 8B model stood out for its exceptional value, scoring 85% at an extremely low cost of $0.0001 per benchmark run. AI

IMPACT New models consistently achieve high scores on coding benchmarks, indicating rapid progress in agent capabilities and cost-efficiency.

RANK_REASON The article details benchmark results for multiple LLMs, including new families and record-breaking scores, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Vilius ·

    I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

    <p><em>By Vilius Vystartas | May 2026</em></p> <p>I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. Th…