PDF-WuKong model efficiently reads long documents with sparse sampling

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed PDF-WuKong, a large multimodal model designed to efficiently process and answer questions about long PDF documents containing both text and images. The model utilizes a novel sparse sampling technique to identify the most relevant information for user queries, improving both efficiency and capability. To support this work, a new dataset called PaperPDF was created, comprising over a million question-answer pairs derived from academic papers. Experiments show PDF-WuKong outperforms existing models and proprietary products by an average of 8.6% on F1 score for multimodal document understanding. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a novel approach for efficient multimodal document understanding, potentially improving research and information retrieval from academic papers.

RANK_REASON This is a research paper introducing a new model and dataset for multimodal document understanding.

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai · 2026-04-28 04:00

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

arXiv:2410.05970v3 Announce Type: replace Abstract: Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of th…

COVERAGE [1]

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

RELATED ENTITIES

RELATED TOPICS