How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural
Researchers have investigated the linearity of Transformer feed-forward networks (FFNs), finding that the degree to which an FFN block is linear is a learned property rather than an architectural one. By measuring the linear recoverability (R^2_lin) across different transformer models like GPT-2, Pythia-160m, and llama-160m, they observed significant variation between adjacent blocks. This measurement also serves as a compression signal, indicating which blocks can be safely replaced with smaller, single-layer approximations. AI
IMPACT Provides insights into the internal workings of transformer models, potentially informing future architectural designs and compression techniques.