MiniPIC: Flexible Position-Independent Caching in <100LOC
Researchers have developed MiniPIC, a new method for efficient caching in large language model inference that requires fewer than 100 lines of code changes to existing systems like vLLM. This approach improves prefill throughput by 49% and significantly reduces latency for cached spans. Separately, a new technique called BudCache has been introduced for diffusion models, which optimizes caching policies based on a fixed compute budget to maintain output quality, outperforming heuristic methods on FLUX.1-dev and Wan2.1. AI
IMPACT These caching innovations promise to reduce inference costs and improve the speed of both large language models and diffusion models.