Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
Researchers have identified a consistent bias in current text embedding models, where each embedding can be decomposed into a sentence-specific component and a near-identical mean component across all sentences. They propose two training-free correction methods, R1 and R2, with R2 showing superior performance by projecting embeddings off the mean direction. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB), R2 consistently improved classification accuracy, with the norm of the mean embedding correlating with model benefit. AI
IMPACT This research offers a method to improve the accuracy of text embeddings, potentially benefiting downstream NLP tasks.