PulseAugur
实时 18:53:30
English(EN) KletterMix: Climbing Toward High-Quality German Pretraining Data

KletterMix 数据集促进德语语言模型预训练

研究人员开发了 KletterMix,一个用于语言模型的高质量德语预训练新数据集。该语料库是通过将最先进的英语预训练数据集翻译成德语而创建的,并仔细保留了文档结构和主题多样性。评估表明,与在现有德语语料库上训练的模型相比,在 KletterMix 上训练的模型在德语任务上取得了更好的性能。 AI

影响 增强了高质量德语数据的可用性,有望提高德语人工智能应用的性能。

排序理由 该集群包含一篇关于用于语言模型预训练的新数据集的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Maurice Kraus, Ruben H\"arle, Sebastian Sztwiertnia, Abbas Goher Khan, Mehdi Ali, Michael Fromm, Kristian Kersting ·

    KletterMix:迈向高质量德语预训练数据

    arXiv:2606.03773v1 Announce Type: new Abstract: High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documen…

  2. arXiv cs.CL TIER_1 English(EN) · Kristian Kersting ·

    KletterMix:迈向高质量德语预训练数据

    High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled tra…