Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [2 sources]

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Researchers have developed QCFuse, a novel method to optimize Retrieval-Augmented Generation (RAG) serving efficiency. This technique addresses the high cost associated with processing retrieved contexts in LLMs by intelligently reusing precomputed KV caches. QCFuse employs a compressed-view query-aware selector that conditions user-query states on compact per-chunk anchors and identifies recomputation tokens without requiring full-layer inspection, achieving full prefill-level quality. AI

IMPACT QCFuse significantly improves RAG serving speed, potentially reducing inference costs and increasing throughput for LLM applications.

Retrieval-Augmented Generation
SGLang
Large language model (LLM)
QCFuse
ProphetKV
Large Language Models