Researchers create Naamah, a large synthetic Sanskrit NER dataset using LLMs

By PulseAugur Editorial · [2 sources] · 2026-04-29 09:12

Researchers have developed Naamah, a synthetic dataset of over 100,000 Sanskrit sentences designed to improve Named Entity Recognition (NER) for classical Sanskrit literature. The dataset was generated by combining entity extraction from DBpedia with a 24-billion parameter hybrid reasoning model. Naamah aims to overcome the scarcity of annotated resources and was used to benchmark XLM RoBERTa and IndicBERTv2 transformer architectures. AI

IMPACT Provides a crucial dataset for advancing NLP capabilities in classical Sanskrit, potentially enabling new research and applications.

RANK_REASON Academic paper introducing a new dataset for a specific NLP task.

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Researchers create Naamah, a large synthetic Sanskrit NER dataset using LLMs

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Akhil Rajeev P, Annarao Kulkarni · 2026-04-30 04:00

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

arXiv:2604.26456v1 Announce Type: new Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentat…
arXiv cs.CL TIER_1 English(EN) · Annarao Kulkarni · 2026-04-29 09:12

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and …

COVERAGE [2]

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

RELATED ENTITIES

RELATED TOPICS