Brief · PulseAugur

TOOL · dev.to — LLM tag English(EN) · 4h

Serverless GPU Inference: Deploy Any Hugging Face Model on Google Cloud Run

This article details a method for deploying Hugging Face language models on Google Cloud Run using serverless GPUs. It outlines a streamlined process involving a Makefile, Dockerfile, and Terraform scripts to automate the build, provisioning, and deployment of models like Qwen/Qwen3.5-4B. The approach focuses on baking model weights into the Docker image at build time, ensuring no runtime downloads and enabling efficient, self-contained deployments on NVIDIA L4 GPUs with an OpenAI-compatible API. AI

IMPACT Enables efficient, cost-effective deployment of LLMs for developers without deep infrastructure expertise.

Hugging Face
Google Cloud
Terraform
vLLM
NVIDIA L4 GPU
Google Cloud Run
Makefile
Dockerfile
Artifact Registry
Qwen/Qwen3.5-4B