XLM-RoBERTa
PulseAugur coverage of XLM-RoBERTa — every cluster mentioning XLM-RoBERTa across labs, papers, and developer communities, ranked by signal.
2 day(s) with sentiment data
-
New datasets and model advance emotional validation in AI dialogue
Researchers have introduced M-EDESConv and M-TESC, new multilingual datasets for emotional validation in dialogue systems, supporting tasks like response identification and timing detection. They also propose MEGUMI, a …
-
New SindBERT model advances Turkish NLP capabilities
Researchers have developed SindBERT, a new large-scale RoBERTa-based language model specifically for Turkish. Trained on over 300 GB of Turkish text, SindBERT is available in base and large configurations, marking the f…
-
Perplexity AI open-sources Rust tokenizer, slashing LLM inference latency
Perplexity AI has open-sourced a new Unigram tokenizer implemented in Rust, which significantly reduces latency and CPU utilization in LLM inference. This new tokenizer achieves up to a 5x lower p50 latency compared to …
-
Perplexity AI open-sources Unigram tokenizer for 5x speedup
Perplexity AI has open-sourced a new Unigram tokenizer designed to significantly improve CPU performance. This new tokenizer achieves a 5x reduction in latency compared to HuggingFace's implementation and a 2x reduction…
-
New dataset boosts Persian social media text classification
Researchers have introduced PerSoMed, a new large-scale dataset designed for classifying Persian social media text. The dataset contains 36,000 posts across nine categories, with each category having 4,000 samples to en…
-
AI models struggle with evolving legal language across geopolitical shifts
Researchers investigated temporal concept drift in legal judgment prediction by training transformer models on Ukrainian court decisions from different geopolitical eras. They found that models trained on older data per…
-
New NLP Models Tackle Dementia Detection in Filipino Speech
Researchers have developed a new approach to dementia detection using natural language processing, focusing on low-resource languages like Filipino. They created a bilingual dataset and evaluated several transformer mod…
-
New dataset RoIt-XMASA aids Romanian and Italian sentiment analysis
Researchers have introduced RoIt-XMASA, a new dataset designed for multilingual sentiment analysis in Romanian and Italian. This dataset includes 36,000 labeled reviews across books, movies, and music, along with over 2…
-
Team DUTH explores multilingual humour retrieval challenges
Researchers from Team DUTH have explored multilingual humour-aware information retrieval using the CLEF 2025 JOKER Task 1 benchmark, which assesses humour retrieval in English and Portuguese. Their approach integrates m…
-
New CA-LIG framework enhances Transformer model explainability
Researchers have developed a new framework called Context-Aware Layer-wise Integrated Gradients (CA-LIG) to improve the explainability of Transformer models. This framework offers a unified, hierarchical approach that c…
-
New pipeline creates NLP resource for historical Greek parliamentary text
Researchers have developed a new, reproducible pipeline for creating a Universal Dependencies-style parsing resource for Katharevousa Greek parliamentary text. This workflow addresses the limitations of current NLP tool…
-
New research tackles continual learning in multilingual and multimodal LLMs
Two new research papers explore advancements in continual learning for large language models. The first paper introduces a multi-stage framework for detecting reclaimed slurs in multilingual social media, utilizing XLM-…
-
XLM-RoBERTa model improves hope speech detection in Tulu
Researchers developed an XLM-RoBERTa-based system for detecting hope speech in code-mixed Tulu social media comments. Their organically adapted model showed improved performance over a baseline on a development set. Whi…
-
New benchmark study explores neural network performance on Tajik POS tagging
This paper introduces the first benchmark for part-of-speech tagging in the Tajik language, evaluating various neural network architectures. The study utilized the TajPersParallel corpus, focusing on context-independent…
-
New Sindhi figurative language dataset SiNFluD released with XLM-RoBERTa-XL benchmark
Researchers have developed SiNFluD, a new dataset for classifying figurative language in Sindhi. The dataset was compiled from various online sources and annotated by native speakers, achieving a high inter-annotator ag…
-
Teams leverage LLMs and ensemble methods for multilingual online polarization detection at SemEval-2026
Researchers have developed systems for SemEval-2026 Task 9, a multilingual polarization detection challenge across 22 languages. One approach fine-tuned Gemma 3 models using Low-Rank Adaptation (LoRA) and augmented data…
-
Researchers create Naamah, a large synthetic Sanskrit NER dataset using LLMs
Researchers have developed Naamah, a synthetic dataset of over 100,000 Sanskrit sentences designed to improve Named Entity Recognition (NER) for classical Sanskrit literature. The dataset was generated by combining enti…
-
XITE technique boosts cross-lingual transfer for language models up to 81%
Researchers have introduced XITE, a novel data augmentation technique designed to improve cross-lingual transfer in multilingual language models. This method leverages embedding similarities to identify and adapt labels…