Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 3d

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Researchers have identified a phenomenon called "weight drift" in neural networks, where optimization processes inadvertently push weights towards negative values. This drift, independent of the training data, occurs with standard loss functions and common activation functions like ReLU and GELU. The study demonstrates that this drift can lead to significant activation sparsity, potentially impacting model accuracy, and can also amplify activation spikes in transformer layers. AI

IMPACT Identifies a fundamental training dynamic that could impact model performance and efficiency across various architectures.

ViT
ReLU
GELU
ResNet
MP-SENe
GPT-nano
Egor Shvetsov