Researchers have developed Asteria, a runtime system that separates second-order optimization logic from the GPU training path to make LLM training more scalable. This system dynamically distributes optimizer state across GPU memory, CPU memory, and storage, while preparing shadow states asynchronously. Separately, a fluid-guided online scheduling approach called WAIT and Nested WAIT has been introduced to optimize LLM inference by managing the KV cache and improving latency and cost-efficiency, especially under heavy load. These advancements aim to make complex optimization methods practical for LLM training and inference. AI
Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →
IMPACT These systems offer potential improvements in the efficiency and cost-effectiveness of both training and deploying large language models.
RANK_REASON The cluster contains two research papers detailing novel systems for optimizing LLM training and inference.