Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 4h

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

InfiniteKV is a new KV cache system designed to extend the context window of large language models by storing older tokens in a compressed, searchable format on disk or in RAM. This approach allows models to access information far beyond their original training limits, as demonstrated by Mistral-7B successfully answering a query from token 76,747, significantly past its 32,768 token limit. The system maintains recent tokens in GPU memory for speed while offloading older ones, drastically reducing memory requirements from gigabytes per million tokens to just a few megabytes. AI

IMPACT Enables LLMs to process and recall information from vastly extended contexts, potentially unlocking new applications in long-form content analysis and generation.