This project details the creation of a synthetic data pipeline specifically designed to improve instruction-following capabilities in Persian Large Language Models (LLMs). The pipeline addresses the scarcity of high-quality Persian datasets by generating structured instruction pairs using models like GPT 4.1 mini and nano. It incorporates multi-stage filtering, including semantic deduplication and LLM-based quality scoring, to ensure data diversity and relevance. The curated dataset, comprising approximately 4,000 instruction pairs across 51 domains, was then used to fine-tune the Qwen2.5 3B Instruct model via QLoRA, demonstrating steady convergence. AI
IMPACT This approach could significantly improve LLM performance in low-resource languages by addressing data scarcity through synthetic generation.
RANK_REASON The item describes a novel method for generating synthetic data to fine-tune LLMs for a specific language, detailing the pipeline and evaluation. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →