Researchers have developed a Graph Memory Transformer (GMT) that replaces the standard Feed-Forward Network (FFN) sublayer in decoder-only language models with an explicit learned memory graph. This new architecture, GMT v7, utilizes 128 centroids and a directed transition matrix within each of its 16 transformer blocks. While the 82.2M parameter GMT model shows comparable zero-shot benchmark performance to a larger GPT-style baseline, it trails in validation loss and perplexity, suggesting potential for future optimization and scaling. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Explores an alternative to dense FFNs, potentially offering more interpretable and efficient transformer architectures.
RANK_REASON Academic paper introducing a novel transformer architecture variant.