PulseAugur
EN
LIVE 14:01:59

Biomedical APIs show inconsistent Unicode fidelity, impacting LLM training data

A recent audit of four major biomedical bibliographic APIs—PubMed E-utilities, Crossref, OpenAlex, and Semantic Scholar—revealed significant inconsistencies in how they handle Unicode characters. The study found that the PubMed AbstractText field frequently failed to preserve typographic punctuation, and OpenAlex systematically lost special whitespace characters. While mathematical symbols and Greek letters were generally preserved, these character-level fidelity issues have direct implications for text mining, bibliometrics, and the construction of training corpora for biomedical large language models. AI

IMPACT Inconsistent data fidelity in biomedical APIs could compromise the quality and reliability of training data for specialized LLMs, potentially affecting their performance in scientific text analysis.

RANK_REASON The cluster contains a research paper detailing an audit of data fidelity in bibliographic APIs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Biomedical APIs show inconsistent Unicode fidelity, impacting LLM training data

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Przemys{\l}aw Czuma ·

    Invisible to humans, visible to machines: a preregistered audit of Unicode fidelity across four biomedical bibliographic APIs

    arXiv:2606.24897v1 Announce Type: cross Abstract: Biomedical text mining, scientometrics, and the construction of training corpora for biomedical large language models (LLMs) all assume that the abstract text returned by a bibliographic API faithfully reproduces the published abs…