Brief · PulseAugur

TOOL · dev.to — LLM tag English(EN) · 5h

We create a way to unload Qwen2.5 KV cache to RAM.

A new technique has been developed to address memory limitations in local large language models, specifically for handling long contexts and maintaining state across restarts. This method involves offloading the model's KV cache, which stores computed internal states, from VRAM to CPU RAM and disk. A small index in VRAM is used to retrieve relevant KV chunks when needed, allowing models to access contexts up to 800,000 tokens while keeping VRAM usage stable. The system also enables models to resume from their stored state after a process restart, effectively acting as a persistent memory. AI

IMPACT Enables local LLMs to handle significantly longer contexts and retain memory across sessions, potentially improving RAG performance and user experience.

LLM
Mamba
Qwen2.5
RTX 5060
Qwen2.5-7B-1M
MiniCPM-1B