Why Self-Hosted Claude Code Was 15 Slower Than It Should Be
A developer detailed how they significantly sped up their self-hosted Claude Code setup by addressing two key performance bottlenecks. The primary issue was a rotating billing header injected by Claude Code, which caused cache misses on the vLLM-MLX backend. Additionally, vLLM-MLX's SimpleEngine lacked persistent KV state for system prefixes, requiring a custom patch for caching. Implementing these changes reduced turn times from over 100 seconds to 7-8 seconds, a 13-15x improvement. AI
IMPACT Optimizations like these are crucial for making self-hosted LLM deployments practical and cost-effective for developers.