Is Dimensionality a Barrier for Retrieval Models?
Researchers have investigated why low-dimensional representations, typically around 1000 dimensions, do not hinder the scalability of modern embedding-based retrieval models to trillions of data points. Their study focuses on maximal-margin embeddings, establishing that a near-optimal margin can be achieved with a dimension dependent on the logarithm of the data size. The findings resolve a previous setup concerning k-sparse rows and suggest that sigmoid loss outperforms InfoNCE for generating large-margin embeddings. AI
IMPACT This research provides theoretical insights into the scalability of retrieval models, potentially influencing future model design for large-scale AI applications.