Brief · PulseAugur

TOOL · dev.to — LLM tag English(EN) · 5h

Why Self-Hosted Claude Code Was 15 Slower Than It Should Be

A developer detailed how they significantly sped up their self-hosted Claude Code setup by addressing two key performance bottlenecks. The primary issue was a rotating billing header injected by Claude Code, which caused cache misses on the vLLM-MLX backend. Additionally, vLLM-MLX's SimpleEngine lacked persistent KV state for system prefixes, requiring a custom patch for caching. Implementing these changes reduced turn times from over 100 seconds to 7-8 seconds, a 13-15x improvement. AI

IMPACT Optimizations like these are crucial for making self-hosted LLM deployments practical and cost-effective for developers.

Anthropic
Claude Code
vLLM-MLX
Qwen2.5-Coder-32B-Instruct-8bit