Brief · PulseAugur

TOOL · dev.to — LLM tag English(EN) · 6d

I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

A recent benchmark test of ten large language models revealed that five new model families debuted with scores of 75% or higher on coding tasks. Two models, Mistral Large 2411 and DeepSeek Chat V3-0324, achieved a record-tying 90% score. The L3 Lunaris 8B model stood out for its exceptional value, scoring 85% at an extremely low cost of $0.0001 per benchmark run. AI

IMPACT New models consistently achieve high scores on coding benchmarks, indicating rapid progress in agent capabilities and cost-efficiency.

DeepSeek
OpenRouter
Qwen
Qwen3 8B
Inflection
Mancer
Qwen Plus 2025-07-28
Sao10k
Undi95
L3 Lunaris 8B
DeepSeek Chat V3-0324
Mistral Large 2411