Brief

last 24h

[2/2] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Modal blog English(EN) · 3d

How we achieved truly serverless GPUs

Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GPUs, a custom filesystem for lazy container image serving, and efficient checkpoint/restore mechanisms for both CPU and GPU processes. This engineering effort, developed over five years, reduces AI inference replica scaling time from tens of minutes to mere seconds, aiming to maximize GPU Allocation Utilization. AI

IMPACT Enables faster, more efficient scaling of AI inference workloads, potentially lowering costs and improving resource utilization.
- xAI
- AWS
- Modal
- SGLang
- Marc Brooker
- AI inference
RESEARCH · arXiv cs.CL Deutsch(DE) · 4d · [3 sources]

FastKernels: Benchmarking GPU Kernel Generation in Production

Researchers have introduced FastKernels, a new benchmark designed to better evaluate GPU kernel generation agents used in production LLM inference. Existing benchmarks are misaligned with real-world systems, leading agents to produce kernels that perform poorly outside of testing environments. FastKernels aims to bridge this gap by serving as a production-grade inference framework that mirrors real-world deployment needs and covers a vast majority of HuggingFace Transformers architectures. AI

IMPACT Addresses a critical bottleneck in LLM inference by improving the alignment of GPU kernel generation benchmarks with production systems.
- GPU kernel generation
- SGLang
- vLLM
- AI inference
- FastKernels
- GPU
- LLM

Brief

How we achieved truly serverless GPUs

FastKernels: Benchmarking GPU Kernel Generation in Production