New Metal-Sci benchmark tests LLMs on scientific compute tasks

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have developed Metal-Sci, a new benchmark designed to evaluate the performance of large language models (LLMs) in scientific computing tasks on Apple Silicon. The benchmark includes 10 distinct tasks across six optimization regimes, featuring CPU references and fitness functions. Initial tests on an Apple M1 Pro chip showed significant in-distribution speedups ranging from 1.00x to 10.7x for models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5. A key methodological contribution is the use of a held-out gate scoring function to provide oversight and detect silent regressions in model performance on unseen data. AI

IMPACT This benchmark could drive LLM development for specialized scientific computing tasks on Apple hardware.

RANK_REASON The cluster contains a research paper detailing a new benchmark for evaluating LLM performance on scientific computing tasks. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New Metal-Sci benchmark tests LLMs on scientific compute tasks

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · V\'ictor Gallego · 2026-06-30 04:00

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

arXiv:2605.09708v2 Announce Type: replace-cross Abstract: We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in $n$-body problems, multi-field Boltzmann, neighbor-list molecular dynami…

COVERAGE [1]

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

RELATED ENTITIES

RELATED TOPICS