New Persian language model IHUBERT advances NLU benchmarks

By PulseAugur Editorial · [1 sources] · 2026-06-18 11:10

Researchers have introduced IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a curated 45 GB dataset derived from the Sepahr-Danesh collection, utilizing a multi-stage preprocessing pipeline that includes semantic deduplication for domain balancing. IHUBERT was evaluated across seven Persian Natural Language Understanding benchmarks, demonstrating strong performance, particularly in extractive question answering where it achieved first place on PQuAD and ParsiNLU-RC. AI

IMPACT Advances Persian language modeling capabilities and sets new benchmarks in specific NLU tasks.

RANK_REASON The cluster describes a new academic paper detailing the creation and evaluation of a Persian language model. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New Persian language model IHUBERT advances NLU benchmarks

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Mohammad Reza Hasani Ahangar · 2026-06-18 11:10

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the Ro…

COVERAGE [1]

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

RELATED ENTITIES

RELATED TOPICS