The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Researchers have investigated why Gated Linear Units (GLU) are superior to non-GLU structures in large language models. Their analysis in the neural tangent kernel regime indicates that GLU reshapes the NTK spectrum, resulting in a smaller condition number and faster convergence. While GLU appears to accelerate optimization, empirical observations suggest it has a limited effect on reducing the generalization gap in models like ViT and GPT-2. AI
IMPACT Explains a key architectural advantage in LLMs, potentially guiding future model design for faster training.