IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Researchers have introduced IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a curated 45 GB dataset derived from the Sepahr-Danesh collection, utilizing a multi-stage preprocessing pipeline that includes semantic deduplication for domain balancing. IHUBERT was evaluated across seven Persian Natural Language Understanding benchmarks, demonstrating strong performance, particularly in extractive question answering where it achieved first place on PQuAD and ParsiNLU-RC. AI
IMPACT Advances Persian language modeling capabilities and sets new benchmarks in specific NLU tasks.