This project details the creation of a synthetic data pipeline specifically designed to improve instruction-following capabilities in Persian Large Language Models (LLMs). The pipeline addresses the scarcity of high-quality Persian datasets by generating structured instruction pairs using models like GPT 4.1 mini and nano. It incorporates multi-stage filtering, including semantic deduplication and LLM-based quality scoring, to ensure data diversity and relevance. The curated dataset, comprising approximately 4,000 instruction pairs across 51 domains, was then used to fine-tune the Qwen2.5 3B Instruct model via QLoRA, demonstrating steady convergence. AI
影响 This approach could significantly improve LLM performance in low-resource languages by addressing data scarcity through synthetic generation.
排序理由 The item describes a novel method for generating synthetic data to fine-tune LLMs for a specific language, detailing the pipeline and evaluation. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →