Training-free sparse attention based on cumulative energy filtering
Researchers have developed LoLA, a novel augmentation for linear attention mechanisms that significantly enhances associative recall and memory capacity in transformer models. LoLA distributes past key-value pairs across three memory systems: a local sliding window, a sparse global cache for difficult-to-memorize pairs, and the recurrent hidden state. This approach improves performance on pass-key retrieval tasks to 97.4% accuracy with a substantially smaller cache than existing models like Llama 3.1 8B, and also outperforms other subquadratic models on commonsense reasoning. AI
IMPACT LoLA's approach to sparse caching and memory management could enable transformers to handle much longer contexts, potentially unlocking new applications in lifelong learning and complex reasoning.