Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
A recent study re-evaluated the effectiveness of Transformer model modifications, finding that most still do not yield significant improvements when scaled to 1-3 billion parameters. Researchers tested 20 modifications introduced after 2021, using downstream evaluation metrics and controlling for variables like data, compute, and training recipes. The findings largely echo a 2021 study, with only a couple of modifications showing benefits, and one of those proving unstable at the larger scale. The research emphasizes the need for rigorous reporting, downstream evaluation, and cross-scale stability testing for architecture comparisons. AI
IMPACT Confirms that architectural innovations in large language models often fail to scale effectively, suggesting a need for more robust evaluation methods.