Brief

last 24h

[2/2] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.LG English(EN) · 11h · [2 sources]

Scaling depth capacity via zero/one-layer model expansion

Two new research papers explore methods to improve the efficiency of large language models by optimizing their depth. The first paper introduces "zero/one-layer progressive training," which can significantly reduce computational costs, saving up to 80% compute for models like GPT-2 and showing substantial efficiency gains on Llama3 and DeepSeekV3. The second paper suggests that LLM performance scales inversely with depth due to functionally similar layers, proposing architectural innovations to encourage more compositional use of depth for better efficiency. AI

IMPACT These studies offer potential pathways to reduce training costs and accelerate LLM development, particularly at larger scales.
- DeepSeekV3
- GPT-2
- Yizhou Liu
- Zhiqi Bu
TOOL · arXiv stat.ML English(EN) · 11h

Universal One-third Time Scaling in Learning Peaked Distributions

Researchers have identified a universal one-third time scaling in the learning process of peaked probability distributions, a phenomenon observed in large language models. This behavior, stemming from the use of softmax and cross-entropy, creates a fundamental optimization bottleneck leading to slow power-law convergence of the loss and gradients. The findings offer a mechanistic explanation for observed neural scaling and suggest avenues for enhancing LLM training efficiency. AI

IMPACT Explains a fundamental bottleneck in LLM training, potentially guiding efforts to improve efficiency.
- Yizhou Liu
- LLMs

Brief

Scaling depth capacity via zero/one-layer model expansion

Universal One-third Time Scaling in Learning Peaked Distributions