PulseAugur
EN
LIVE 14:56:45

Deploy Hugging Face LLMs on Google Cloud Run with Serverless GPUs

This article details a method for deploying Hugging Face language models on Google Cloud Run using serverless GPUs. It outlines a streamlined process involving a Makefile, Dockerfile, and Terraform scripts to automate the build, provisioning, and deployment of models like Qwen/Qwen3.5-4B. The approach focuses on baking model weights into the Docker image at build time, ensuring no runtime downloads and enabling efficient, self-contained deployments on NVIDIA L4 GPUs with an OpenAI-compatible API. AI

IMPACT Enables efficient, cost-effective deployment of LLMs for developers without deep infrastructure expertise.

RANK_REASON The article describes a method for deploying existing models on a cloud platform, which is a tooling-related use case.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Boris Barac ·

    Serverless GPU Inference: Deploy Any Hugging Face Model on Google Cloud Run

    <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>curl https://vllm-endpoint-xxxxx-ew4.a.run.app/v1/chat/completions <span class="se">\</span> <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span> …