PulseAugur
实时 21:43:00

Researchers create Naamah, a large synthetic Sanskrit NER dataset using LLMs

Researchers have developed Naamah, a synthetic dataset of over 100,000 Sanskrit sentences designed to improve Named Entity Recognition (NER) for classical Sanskrit literature. The dataset was generated by combining entity extraction from DBpedia with a 24-billion parameter hybrid reasoning model. Naamah aims to overcome the scarcity of annotated resources and was used to benchmark XLM RoBERTa and IndicBERTv2 transformer architectures. AI

影响 Provides a crucial dataset for advancing NLP capabilities in classical Sanskrit, potentially enabling new research and applications.

排序理由 Academic paper introducing a new dataset for a specific NLP task.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

Researchers create Naamah, a large synthetic Sanskrit NER dataset using LLMs

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Akhil Rajeev P, Annarao Kulkarni ·

    Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

    arXiv:2604.26456v1 Announce Type: new Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentat…

  2. arXiv cs.CL TIER_1 English(EN) · Annarao Kulkarni ·

    Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

    The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and …