A 9-point eval gain vanished when we deduped train against test
A machine learning team at Nexus Labs discovered that a significant performance increase in their fine-tuned Qwen3-8B model was due to data contamination. The model achieved an 80.4% accuracy on a ticket-routing task, a jump from the base model's 71.2%, but this gain was illusory. Upon using MinHash LSH to detect near-duplicate entries between the training and evaluation datasets, they found that about 6% of the evaluation data had been inadvertently included in the training set. After removing these contaminated samples, the model's true accuracy was closer to 72%, indicating minimal actual improvement from the fine-tuning process. AI
IMPACT Highlights the critical need for rigorous data validation in ML pipelines to prevent inflated performance metrics and ensure genuine model generalization.