研究人员开发从维基媒体转储创建高质量训练语料库的方法

作者 PulseAugur 编辑部 · [3 个来源] · 2026-04-28 08:51

研究人员开发了一种方法，可以从原始维基媒体转储中为七种南斯拉夫语创建高质量的训练语料库。该过程包括两个主要阶段：从各种维基百科项目中提取和清理文本，然后使用基于n-gram的策略过滤掉低质量或重复的文章。这种方法旨在生成适合训练语言模型和进行比较语言学研究的语言丰富的数据集，并有可能推广到其他语言。 AI

影响提供了一种生成专业语言语料库的可扩展方法，有可能提高大型语言模型在资源匮乏语言上的性能。

排序理由详细介绍创建训练数据方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.CL TIER_1 English(EN) · Mihailo \v{S}kori\'c · 2026-04-29 04:00

Wiki Dumps to Training Corpora: South Slavic Case

arXiv:2604.25384v1 Announce Type: new Abstract: This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from ra…
arXiv cs.CL TIER_1 English(EN) · Mihailo Škorić · 2026-04-28 08:51

Wiki Dumps to Training Corpora: South Slavic Case

This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wik…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-28 08:51

Wiki Dumps to Training Corpora: South Slavic Case

This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wik…

报道来源 [3]

Wiki Dumps to Training Corpora: South Slavic Case

Wiki Dumps to Training Corpora: South Slavic Case

Wiki Dumps to Training Corpora: South Slavic Case

相关实体

相关话题