KletterMix: Climbing Toward High-Quality German Pretraining Data
Researchers have developed KletterMix, a new high-quality German pretraining dataset for language models. This corpus was created by translating a state-of-the-art English pretraining dataset into German, carefully preserving document structure and topical diversity. Evaluations show that models trained on KletterMix achieve improved performance on German-language tasks compared to those trained on existing German corpora. AI
IMPACT Enhances the availability of high-quality German data, potentially improving the performance of German-language AI applications.