QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
Researchers have developed QCFuse, a novel method to optimize Retrieval-Augmented Generation (RAG) serving efficiency. This technique addresses the high cost associated with processing retrieved contexts in LLMs by intelligently reusing precomputed KV caches. QCFuse employs a compressed-view query-aware selector that conditions user-query states on compact per-chunk anchors and identifies recomputation tokens without requiring full-layer inspection, achieving full prefill-level quality. AI
IMPACT QCFuse significantly improves RAG serving speed, potentially reducing inference costs and increasing throughput for LLM applications.