PulseAugur
LIVE 07:58:25
research · [2 sources] ·
0
research

Researchers create Naamah, a large synthetic Sanskrit NER dataset using LLMs

Researchers have developed Naamah, a synthetic dataset of over 100,000 Sanskrit sentences designed to improve Named Entity Recognition (NER) for classical Sanskrit literature. The dataset was generated by combining entity extraction from DBpedia with a 24-billion parameter hybrid reasoning model. Naamah aims to overcome the scarcity of annotated resources and was used to benchmark XLM RoBERTa and IndicBERTv2 transformer architectures. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Provides a crucial dataset for advancing NLP capabilities in classical Sanskrit, potentially enabling new research and applications.

RANK_REASON Academic paper introducing a new dataset for a specific NLP task.

Read on arXiv cs.CL →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Akhil Rajeev P, Annarao Kulkarni ·

    Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

    arXiv:2604.26456v1 Announce Type: new Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentat…

  2. arXiv cs.CL TIER_1 · Annarao Kulkarni ·

    Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

    The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and …