PulseAugur / Brief
EN
LIVE 02:05:27

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. We create a way to unload Qwen2.5 KV cache to RAM.

    A new technique has been developed to address memory limitations in local large language models, specifically for handling long contexts and maintaining state across restarts. This method involves offloading the model's KV cache, which stores computed internal states, from VRAM to CPU RAM and disk. A small index in VRAM is used to retrieve relevant KV chunks when needed, allowing models to access contexts up to 800,000 tokens while keeping VRAM usage stable. The system also enables models to resume from their stored state after a process restart, effectively acting as a persistent memory. AI

    IMPACT Enables local LLMs to handle significantly longer contexts and retain memory across sessions, potentially improving RAG performance and user experience.