HORST: Composing Optimizer Geometries for Sparse Transformer Training
Researchers have developed HORST, a novel optimizer designed to improve the training of sparse transformers. Standard optimizers struggle to balance the need for sparsity with training stability. HORST addresses this by composing optimizer steps as non-commutative operators, integrating hyperbolic geometry to achieve both stability and L1 sparsity bias. Experiments show HORST significantly outperforms AdamW baselines, especially at higher sparsity levels, across vision and language tasks. AI
IMPACT Enables more efficient training of sparse transformer models, potentially leading to smaller and faster AI systems.