I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.
A recent benchmark test of ten large language models revealed that five new model families debuted with scores of 75% or higher on coding tasks. Two models, Mistral Large 2411 and DeepSeek Chat V3-0324, achieved a record-tying 90% score. The L3 Lunaris 8B model stood out for its exceptional value, scoring 85% at an extremely low cost of $0.0001 per benchmark run. AI
IMPACT New models consistently achieve high scores on coding benchmarks, indicating rapid progress in agent capabilities and cost-efficiency.