Transformer models gain absolute position awareness from causal mask and residual stream

By PulseAugur Editorial · [2 sources] · 2026-06-04 13:32

Researchers have identified two key architectural components in decoder-only Transformers that contribute to the model's ability to distinguish absolute position, despite positional encoding methods like RoPE primarily encoding relative offsets. These components are the causal mask, whose softmax denominator is inherently dependent on query position, and the residual stream, which acts as a dynamical system at position 0. The study analyzes how different architectural choices, such as NTK scaling and sliding-window attention, interact with these components to influence the model's positional awareness. AI

IMPACT Reveals how architectural choices enable absolute position understanding in LLMs, potentially guiding future model design.

RANK_REASON The cluster contains an academic paper detailing novel research findings on Transformer architecture.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Transformer models gain absolute position awareness from causal mask and residual stream

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri · 2026-06-05 04:00

Where does Absolute Position come from in decoder-only Transformers?

arXiv:2606.06160v1 Announce Type: cross Abstract: RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is re…
arXiv cs.AI TIER_1 English(EN) · Fabrizio Silvestri · 2026-06-04 13:32

Where does Absolute Position come from in decoder-only Transformers?

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax den…

COVERAGE [2]

Where does Absolute Position come from in decoder-only Transformers?

Where does Absolute Position come from in decoder-only Transformers?

RELATED ENTITIES

RELATED TOPICS