Researchers have discovered that the residual stream in transformers, often likened to working memory, exhibits a distinct geometry related to time. By analyzing the Gemma-2-2B model, they found that information persisting across many tokens concentrates in a low-dimensional subspace, rather than being diffuse. This persistent information is highly sensitive to sequential order, as shuffling tokens drastically reduces the timescale of these slow directions. AI
IMPACT Reveals how transformers might encode temporal information, potentially guiding future model architectures and interpretability methods.
RANK_REASON The cluster contains a research paper detailing experimental findings on transformer model internals. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →