Deploy Hugging Face LLMs on Google Cloud Run with Serverless GPUs

By PulseAugur Editorial · [1 sources] · 2026-06-15 12:53

This article details a method for deploying Hugging Face language models on Google Cloud Run using serverless GPUs. It outlines a streamlined process involving a Makefile, Dockerfile, and Terraform scripts to automate the build, provisioning, and deployment of models like Qwen/Qwen3.5-4B. The approach focuses on baking model weights into the Docker image at build time, ensuring no runtime downloads and enabling efficient, self-contained deployments on NVIDIA L4 GPUs with an OpenAI-compatible API. AI

IMPACT Enables efficient, cost-effective deployment of LLMs for developers without deep infrastructure expertise.

RANK_REASON The article describes a method for deploying existing models on a cloud platform, which is a tooling-related use case.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Deploy Hugging Face LLMs on Google Cloud Run with Serverless GPUs

COVERAGE [1]

dev.to — LLM tag TIER_1 English(EN) · Boris Barac · 2026-06-15 12:53

Serverless GPU Inference: Deploy Any Hugging Face Model on Google Cloud Run

<div class="highlight js-code-highlight"> <pre class="highlight shell"><code>curl https://vllm-endpoint-xxxxx-ew4.a.run.app/v1/chat/completions \ -H "Content-Type: application/json" \ …

COVERAGE [1]

Serverless GPU Inference: Deploy Any Hugging Face Model on Google Cloud Run

RELATED ENTITIES

RELATED TOPICS