Brief · PulseAugur

RESEARCH · arXiv cs.CV English(EN) · 1d · [3 sources]

Training-free sparse attention based on cumulative energy filtering

Researchers have developed LoLA, a novel augmentation for linear attention mechanisms that significantly enhances associative recall and memory capacity in transformer models. LoLA distributes past key-value pairs across three memory systems: a local sliding window, a sparse global cache for difficult-to-memorize pairs, and the recurrent hidden state. This approach improves performance on pass-key retrieval tasks to 97.4% accuracy with a substantially smaller cache than existing models like Llama 3.1 8B, and also outperforms other subquadratic models on commonsense reasoning. AI

IMPACT LoLA's approach to sparse caching and memory management could enable transformers to handle much longer contexts, potentially unlocking new applications in lifelong learning and complex reasoning.

Hugging Face
Llama 3.1:8b
arXiv
transformer
DagsHub
linear attention
Diffusion Transformers
VBench
Flash Attention
Wan-2.2
arXivLabs
LoLA
Luke McDermott