Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
ByPulseAugur Editorial·
Summary by None
from 11 sources
Researchers are exploring the fundamental mechanisms behind transformerattention, with new papers analyzing its gradient flow structure and dynamics. One study interprets attention as a gradient flow on a unit sphere, identifying factors that influence token clustering and stability in multi-head settings. Another paper investigates the critical training windows for complexity control, determining when transformers prioritize reasoning over memorization. Additionally, research is uncovering the origins of geometric continuity in deep neural networks, attributing it to residual connections and symmetry-breaking nonlinearities, and examining the structural causes of the "attention sink" phenomenon.
AI
IMPACT
These theoretical analyses offer deeper insights into transformer behavior, potentially guiding future architectural improvements and training strategies for more efficient and capable models.
RANK_REASON
Multiple arXiv papers published on theoretical aspects of transformer attention mechanisms and training dynamics.
arXiv:2605.06611v1 Announce Type: new Abstract: Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechan…
arXiv:2605.04279v1 Announce Type: new Abstract: Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for sin…
arXiv:2605.04396v1 Announce Type: new Abstract: Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-…
arXiv:2605.04971v1 Announce Type: new Abstract: Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experi…
Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experiments on toy MLPs and small transformers, we ide…
arXiv cs.LG
TIER_1·Zheng-An Chen, Pengxiao Lin, Zhi-Qin John Xu, Tao Luo·
arXiv:2605.01199v1 Announce Type: new Abstract: Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention lear…
arXiv:2603.13381v2 Announce Type: replace Abstract: Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X…
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, w…
arXiv:2604.24878v1 Announce Type: cross Abstract: We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyon…
We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyond universal approximation statements. We showcase …