PulseAugur
EN
LIVE 08:45:15

New dataset structures 23M PubMed abstracts for AI analysis

Researchers have introduced "Structured PubMed," a large dataset containing over 23.2 million biomedical abstracts from PubMed. This dataset aims to improve information retrieval and text mining by providing section-labeled abstracts. It includes both author-structured abstracts and those automatically labeled using a Large Language Model pipeline, offering a valuable resource for training classification models and benchmarking text-segmentation architectures. AI

IMPACT Enables more precise information extraction and knowledge synthesis from biomedical literature.

RANK_REASON The cluster contains a research paper detailing a new dataset.

Read on arXiv cs.IR (Information Retrieval) →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Chia-Hsuan Chang, Haerin Song, Brian Ondov, Hua Xu ·

    A PubMed-Scale Dataset of Structured Biomedical Abstracts

    arXiv:2606.11361v1 Announce Type: cross Abstract: Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, pres…

  2. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Hua Xu ·

    A PubMed-Scale Dataset of Structured Biomedical Abstracts

    Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream tex…