PulseAugur
LIVE 09:22:10
research · [4 sources] ·
1
research

New systems optimize LLM training and inference efficiency

Researchers have developed Asteria, a runtime system that separates second-order optimization logic from the GPU training path to make LLM training more scalable. This system dynamically distributes optimizer state across GPU memory, CPU memory, and storage, while preparing shadow states asynchronously. Separately, a fluid-guided online scheduling approach called WAIT and Nested WAIT has been introduced to optimize LLM inference by managing the KV cache and improving latency and cost-efficiency, especially under heavy load. These advancements aim to make complex optimization methods practical for LLM training and inference. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT These systems offer potential improvements in the efficiency and cost-effectiveness of both training and deploying large language models.

RANK_REASON The cluster contains two research papers detailing novel systems for optimizing LLM training and inference.

Read on arXiv cs.LG →

COVERAGE [4]

  1. arXiv cs.LG TIER_1 · Wes Armour ·

    Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

    Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remov…

  2. arXiv stat.ML TIER_1 · Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang ·

    Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

    arXiv:2504.11320v3 Announce Type: replace-cross Abstract: Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and co…

  3. Medium — fine-tuning tag TIER_1 · QuarkAndCode ·

    Why Pretrained LLMs Need Fine-Tuning for Better AI Performance

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@QuarkAndCode/why-pretrained-llms-need-fine-tuning-for-better-ai-performance-6541293f9fef?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1024/1*y3FRj0ALAXfwrMOzXPZ…

  4. Medium — MLOps tag TIER_1 · Charan Panthangi ·

    Inference Optimization — How to Make LLMs Faster and Cheaper in Production

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@charan.panthangi/inference-optimization-how-to-make-llms-faster-and-cheaper-in-production-2778cd00d921?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1200/1*tyCL0_ikRhY…