PulseAugur
EN
LIVE 05:56:48

New Slovak Text Embedding Benchmark and Models Released

Researchers have introduced SkMTEB, a new benchmark designed to evaluate text embedding models specifically for the Slovak language. This benchmark includes 31 datasets across 7 task types, significantly expanding coverage for this low-resource language. The study found that large multilingual models performed best, while existing Slovak-specific NLU models did not transfer well to embedding tasks. To address this, the team developed two open-source Slovak embedding models, \texttt{e5-sk-small} and \texttt{e5-sk-large}, which offer competitive performance with proprietary APIs while being locally deployable. AI

IMPACT Provides a new evaluation framework and open-source models for Slovak language AI applications, potentially enabling better semantic search and RAG.

RANK_REASON The cluster describes a new academic paper introducing a benchmark and models for a specific language.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Marek \v{S}uppa, Andrej Ridzik, Daniel Hl\'adek, Nat\'alia K\v{n}a\v{z}ekov\'a, Vikt\'oria Ondrejov\'a ·

    SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

    arXiv:2606.13647v1 Announce Type: cross Abstract: We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual be…

  2. arXiv cs.AI TIER_1 English(EN) · Viktória Ondrejová ·

    SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

    We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 …