A new study published on arXiv evaluates four open-source PDF-to-Markdown conversion frameworks for their impact on domain-specific question-answering accuracy within Retrieval-Augmented Generation (RAG) systems. The research found that Docling, when combined with hierarchical splitting and image descriptions, achieved the highest accuracy (94.1%), outperforming even manually curated Markdown. The study highlights that data preparation quality, particularly table-dependent question handling and metadata enrichment, is more critical to RAG performance than the choice of conversion framework alone. AI
IMPACT Highlights that effective data preparation is key to RAG performance, influencing how AI systems process and utilize information.
RANK_REASON Academic paper evaluating specific technical methods for AI systems. [lever_c_demoted from research: ic=1 ai=1.0]
- DeepSeek OCR
- Docling
- José Paulo Marques Dos Santos
- LLM-as-judge
- Markdown
- PDFLoader
- Retrieval-Augmented Generation
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →