Running OpenAI’s gpt-oss-20b with 128k Context on a Single L4 GPU
An engineer has successfully deployed OpenAI's gpt-oss-20b model, enabling a 128,000 token context window on a single NVIDIA L4 GPU. This setup, running in production for six months, leverages mxfp4 quantization for efficient weight storage and an FP8 KV cache, allowing the entire model and cache to fit within the GPU's 24GB VRAM. The model's native compatibility with OpenAI's tool-calling format and internal chain-of-thought reasoning further enhance its utility for complex analytical tasks. AI
IMPACT Demonstrates efficient deployment of large context models on accessible hardware, potentially lowering barriers for complex AI applications.