A recent audit of four major biomedical bibliographic APIs—PubMed E-utilities, Crossref, OpenAlex, and Semantic Scholar—revealed significant inconsistencies in how they handle Unicode characters. The study found that the PubMed AbstractText field frequently failed to preserve typographic punctuation, and OpenAlex systematically lost special whitespace characters. While mathematical symbols and Greek letters were generally preserved, these character-level fidelity issues have direct implications for text mining, bibliometrics, and the construction of training corpora for biomedical large language models. AI
IMPACT Inconsistent data fidelity in biomedical APIs could compromise the quality and reliability of training data for specialized LLMs, potentially affecting their performance in scientific text analysis.
RANK_REASON The cluster contains a research paper detailing an audit of data fidelity in bibliographic APIs. [lever_c_demoted from research: ic=1 ai=1.0]
- biomedical large language models
- Crossref
- Elsevier
- OpenAlex
- PubMed Central (PMC) JATS XML
- PubMed E-utilities
- Semantic Scholar
- Unicode
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →