Brief · PulseAugur

COMMENTARY · r/LocalLLaMA English(EN) · 4h

Most reliable way to do PDF to JSON?

A user on r/LocalLLaMA is seeking the most reliable method for converting PDF documents into JSON format, particularly for documents with tables and occasional images. They are currently using PyMuPDF and pymupdf4llm to extract text and then feeding it to an LLM, but are encountering issues with hallucination and missing data for specific fields like dates, especially when multiple dates are present. The user is also looking for ways to reduce the processing time, which currently takes 5-7 minutes for 15-page documents, and is asking for alternative workflow suggestions. AI

IMPACT Users are exploring efficient methods for extracting structured data from documents using LLMs, indicating a need for improved tooling and techniques in this area.

LLM
PyMuPDF
pymupdf4llm