Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
Researchers have developed a new method called DICE (Document Inference via Chunk Evidence) to improve long-document retrieval in dense retrieval systems. This technique addresses the issue where crucial information within long documents can be diluted during encoding, leading to retrieval failures. DICE works by splitting documents into chunks, encoding them independently, and then aggregating these representations into a single vector while maintaining the standard one-query-one-document interface. The method has shown significant improvements, particularly for documents exceeding 4k tokens, by reducing the Evidence Dilution Index (EDI) compared to traditional single-vector baselines. AI
IMPACT This method could significantly improve the performance of search and retrieval systems dealing with extensive textual data.