Researchers have introduced Erase-then-Delta Attention (EDA), a novel memory update rule designed to enhance recurrent memory models. Unlike previous methods that anchor corrections to the write address, EDA decouples the erase and write operations, allowing for the active suppression of outdated information at a separate address before new content is written. This dual capability expands memory management capacity, proving effective in language model pretraining experiments with both dense and Mixture-of-Experts (MoE) architectures. EDA also demonstrates superior performance in long-context evaluations, maintaining its advantage even after extensive midtraining. AI
IMPACT This new attention mechanism could improve the efficiency and long-context capabilities of future language models.
RANK_REASON The cluster contains a research paper detailing a new method for attention mechanisms in language models.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →