Researchers have developed a method to create high-quality training corpora for seven South Slavic languages from raw Wikimedia dumps. The process involves two main stages: extracting and cleaning text from various Wikipedia projects, and then filtering out low-quality or repetitive articles using an n-gram-based strategy. This approach aims to produce linguistically rich datasets suitable for training language models and conducting comparative linguistic research, with potential for generalization to other languages. AI
影响 Provides a scalable method for generating specialized language corpora, potentially improving LLM performance on under-resourced languages.
排序理由 Academic paper detailing a methodology for creating training data.
AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →