Scaling depth capacity via zero/one-layer model expansion
Two new research papers explore methods to improve the efficiency of large language models by optimizing their depth. The first paper introduces "zero/one-layer progressive training," which can significantly reduce computational costs, saving up to 80% compute for models like GPT-2 and showing substantial efficiency gains on Llama3 and DeepSeekV3. The second paper suggests that LLM performance scales inversely with depth due to functionally similar layers, proposing architectural innovations to encourage more compositional use of depth for better efficiency. AI
IMPACT These studies offer potential pathways to reduce training costs and accelerate LLM development, particularly at larger scales.