A new Rust-based inference engine called MER allows for efficient streaming of large language models like Mixtral 8x7B from NVMe storage onto less powerful and cheaper virtual machines. This approach bypasses the need for high-end GPUs by loading model experts on demand, caching frequently used ones in RAM, and achieving 3.32 tps on a $0.40/hour VM. The engine demonstrated a 15.56% cache hit rate and is currently CPU-bound, with plans to integrate GPU inference for further performance gains. AI
IMPACT Enables running large models on cheaper hardware, potentially lowering the barrier to entry for AI development and deployment.
RANK_REASON The article details a new inference engine, MER, and its performance benchmarks, which is a software tool.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →