PulseAugur
EN
LIVE 09:09:54

New dataset boosts Persian social media text classification

Researchers have introduced PerSoMed, a new large-scale dataset designed for classifying Persian social media text. The dataset contains 36,000 posts across nine categories, with each category having 4,000 samples to ensure balance. The study benchmarks various models, finding that transformer-based architectures, particularly TookaBERT-Large, perform best. This resource aims to advance Persian Natural Language Processing research. AI

IMPACT Provides a foundational resource for advancing Persian NLP tasks like trend analysis and user classification.

RANK_REASON The cluster contains a research paper introducing a new dataset and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New dataset boosts Persian social media text classification

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Isun Chehreh, Ebrahim Ansari ·

    PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

    arXiv:2602.19333v2 Announce Type: replace Abstract: This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 po…