Pooling and Semantic Shift: The Fundamental Challenges in Long Text Embedding and Retrieval
A new paper published on arXiv identifies pooling operations and semantic shift as the primary drivers of embedding collapse in long text, rather than text length or attention mechanisms alone. The research establishes a theoretical framework demonstrating how contextual pooling inherently causes semantic dilution and spatial concentration of vectors. Experiments show that semantic shift is the main predictor of embedding concentration, and anisotropy is only detrimental when caused by significant semantic shifts, offering a new explanation for challenges in long-context retrieval. AI
IMPACT Provides a theoretical framework and experimental evidence to address fundamental challenges in long text embedding, potentially improving retrieval systems.