Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining
A new research paper explores the critical role of synthetic data composition in pretraining time series foundation models. The study found that the choice of synthetic data generator can lead to a twofold difference in forecasting error, and these generator rankings are not consistent across different model architectures. Researchers propose that mixing multiple generators with real data creates the strongest pretraining corpora, framing the problem as one of corpus composition rather than generator selection. AI
IMPACT Highlights the importance of synthetic data composition for time series models, potentially improving forecasting accuracy and model development.