In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation
Researchers have developed a reproducible pipeline for supervised classification of pathology reports, addressing the issue of performance degradation when models are applied to data from different cancer registries. The pipeline standardizes data curation and includes a manual audit to identify label noise. A model trained using this method, referred to as the Kentucky model, achieved a significantly lower false-negative rate and a higher F1 score compared to a baseline model trained in Seattle, indicating improved accuracy and reduced reviewer workload. AI
IMPACT This research offers a standardized method to improve the accuracy and reliability of AI models in processing sensitive medical data across different sources.