The Crystallization of Transformer Architectures (2017-2025)
A recent analysis of 53 large language models from 2017 to 2025 reveals a significant convergence in transformer architectures. Key elements of this de facto standard include pre-normalization (RMSNorm), Rotary Position Embeddings (RoPE), SwiGLU activation functions in MLPs, and shared key-value attention mechanisms (MQA/GQA). This convergence is attributed to factors like improved optimization stability, better quality-per-FLOP, and practical considerations such as kernel availability and KV-cache economics. AI
IMPACT Identifies a standardized set of architectural components that may guide future LLM development and optimization.