SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
Researchers have introduced SiameseNorm, a novel two-stream architecture designed to resolve the long-standing conflict between Pre- and Post-Norm in Transformer models. This approach couples Pre-Norm and Post-Norm streams within shared residual blocks, enabling improved training stability and representational capacity without significant overhead. Experiments across various model sizes and types, including dense language models, Vision Transformers, and Diffusion Transformers, demonstrate consistent performance gains and stable training. AI
IMPACT Introduces a novel architecture that enhances training stability and performance across various Transformer models.