Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU
New Kubernetes operators are emerging to address the cost of running large language models, particularly the issue of idle GPUs burning money. Hearth, an alpha-stage operator, allows users to declaratively serve open-source LLMs and scale them down to zero when not in use, buffering requests during cold starts. Another approach involves building a KEDA external scaler using NVML to enable autoscaling based on actual GPU utilization, reducing the need for a full metrics stack like Prometheus. AI
IMPACT Enables cost-effective self-hosting of LLMs by reducing idle GPU expenditure.