arXiv:2605.05697v1 Announce Type: new Abstract: Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a re…
arXiv:2603.11161v2 Announce Type: replace Abstract: We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion …
arXiv cs.LG
TIER_1English(EN)·Lena Ehrmuth, Laura Strieker·
arXiv:2605.04683v1 Announce Type: cross Abstract: We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. …
arXiv:2603.17771v2 Announce Type: replace Abstract: Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing explanations have largely focused on the forward pass, yet in pre-norm Transformers, large residual-stream norms…
We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. The circuit families that can be simulated this wa…
arXiv:2605.03110v1 Announce Type: new Abstract: A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects $r \l…
arXiv cs.CL
TIER_1English(EN)·Jiaoda Li, Ryan Cotterell·
arXiv:2605.00768v1 Announce Type: new Abstract: The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generat…
The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attent…
<h2> Decoder-Only Transformers </h2> <p>In this article, we will explore <strong>decoder-only transformers</strong>.</p> <p>Decoder-only transformers are a specific type of transformer architecture used in systems like ChatGPT.</p> <h2> Masked Self-Attention </h2> <p>Decoder-only…