Researchers have developed PDF-WuKong, a large multimodal model designed to efficiently process and answer questions about long PDF documents containing both text and images. The model utilizes a novel sparse sampling technique to identify the most relevant information for user queries, improving both efficiency and capability. To support this work, a new dataset called PaperPDF was created, comprising over a million question-answer pairs derived from academic papers. Experiments show PDF-WuKong outperforms existing models and proprietary products by an average of 8.6% on F1 score for multimodal document understanding. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a novel approach for efficient multimodal document understanding, potentially improving research and information retrieval from academic papers.
RANK_REASON This is a research paper introducing a new model and dataset for multimodal document understanding.