PulseAugur
实时 23:22:51

Researchers develop method to create quality training corpora from Wikimedia dumps

Researchers have developed a method to create high-quality training corpora for seven South Slavic languages from raw Wikimedia dumps. The process involves two main stages: extracting and cleaning text from various Wikipedia projects, and then filtering out low-quality or repetitive articles using an n-gram-based strategy. This approach aims to produce linguistically rich datasets suitable for training language models and conducting comparative linguistic research, with potential for generalization to other languages. AI

影响 Provides a scalable method for generating specialized language corpora, potentially improving LLM performance on under-resourced languages.

排序理由 Academic paper detailing a methodology for creating training data.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

Researchers develop method to create quality training corpora from Wikimedia dumps

报道来源 [3]

  1. arXiv cs.CL TIER_1 English(EN) · Mihailo \v{S}kori\'c ·

    Wiki Dumps to Training Corpora: South Slavic Case

    arXiv:2604.25384v1 Announce Type: new Abstract: This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from ra…

  2. arXiv cs.CL TIER_1 English(EN) · Mihailo Škorić ·

    Wiki Dumps to Training Corpora: South Slavic Case

    This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wik…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Wiki Dumps to Training Corpora: South Slavic Case

    This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wik…