Most models are only evaluated on a fraction of the benchmarks out there.
AI2 has developed a new system called ArtifactLinker to address the issue of incomplete model evaluations. This system predicts which benchmarks a model is likely to excel on and then performs the actual evaluation to confirm state-of-the-art results. The goal is to provide a more comprehensive understanding of model capabilities by testing them across a wider range of benchmarks. AI
IMPACT Provides a more robust method for evaluating AI models, potentially leading to more accurate comparisons and development.