MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation
Researchers have developed MVR-cache, a new semantic caching system designed to reduce the costs and latency associated with Large Language Models (LLMs). This system utilizes Multi-Vector Retrieval (MVR) and a learnable prompt segmentation model to achieve more accurate identification of matching prompts. By intelligently splitting prompts and employing a reinforcement learning strategy, MVR-cache has demonstrated an increase in cache hit rates by up to 37% compared to existing state-of-the-art methods, while maintaining strict correctness guarantees. AI
IMPACT MVR-cache's significant improvement in cache hit rates could lead to reduced operational costs and faster response times for LLM-powered applications.