Researchers have theoretically connected activation sparsity in Transformer MLPs to the flatness of their loss landscapes. They propose that this sparsity, which can reduce computational costs, is influenced by a ratio involving "augmented flatness" and input/gradient norms. The study also introduces "derivative sparsity" as a more stable alternative that aids backward propagation pruning. Experiments on ImageNet-1K and C4 showed significant improvements in both training and inference sparsity compared to standard Transformers. AI
IMPACT Potential for significant reductions in AI model training and inference costs.
RANK_REASON Academic paper on theoretical AI concepts and empirical findings. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →