PulseAugur / Brief
EN
LIVE 07:04:56

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

    Researchers have introduced IHUBERT, a new Persian language model built on the RoBERTa-base encoder. This model was trained on a curated 45 GB dataset derived from the Sepahr-Danesh collection, utilizing a multi-stage preprocessing pipeline that includes semantic deduplication for domain balancing. IHUBERT was evaluated across seven Persian Natural Language Understanding benchmarks, demonstrating strong performance, particularly in extractive question answering where it achieved first place on PQuAD and ParsiNLU-RC. AI

    IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

    IMPACT Advances Persian language modeling capabilities and sets new benchmarks in specific NLU tasks.