Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 19h

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Researchers have introduced IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a curated 45 GB dataset derived from the Sepahr-Danesh collection, utilizing a multi-stage preprocessing pipeline that includes semantic deduplication for domain balancing. IHUBERT was evaluated across seven Persian Natural Language Understanding benchmarks, demonstrating strong performance, particularly in extractive question answering where it achieved first place on PQuAD and ParsiNLU-RC. AI

IMPACT Advances Persian language modeling capabilities and sets new benchmarks in specific NLU tasks.

Hugging Face
IHUBERT
Sepahr-Danesh
ParsiNLU-RC
FarsTail
ParsTwiNER
DigiMag
PERLEX