LLM Benchmark Datasets Should Be Contamination-Resistant
A new paper argues that benchmark datasets used to evaluate large language models (LLMs) must be resistant to contamination from pretraining data. The authors highlight that many current benchmarks are already included in LLM training corpora, diminishing their effectiveness in measuring true generalization. They propose leveraging architectural asymmetries in Transformer models to create datasets that are unlearnable during training but still usable for inference, calling for community adoption of these contamination-resistant methods. AI
IMPACT Ensures more reliable evaluation of LLM capabilities by preventing benchmark contamination.