Belebele
PulseAugur coverage of Belebele — every cluster mentioning Belebele across labs, papers, and developer communities, ranked by signal.
3 day(s) with sentiment data
-
LangMAP tokenization improves multilingual model performance
Researchers have introduced LangMAP, a novel language-adaptive tokenization approach that generates language-specific tokenization from a single shared vocabulary. This method, based on the UnigramLM algorithm, can be a…
-
Multilingual Code-Switching Boosts LLM Performance Across Four Languages
Researchers have explored the impact of multilingual code-switching data (CSD) on large language models (LLMs) across four languages: English, Japanese, Korean, and Chinese. Their experiments demonstrated that incorpora…
-
New research tackles multilingual adaptation in Mixture-of-Experts models
Two new research papers explore the adaptation of Mixture-of-Experts (MoE) models for multilingual tasks. One paper analyzes how language specialization emerges in MoE models during continual pre-training, finding that …