PulseAugur
EN
LIVE 01:12:39

Rust engine streams Mixtral 8x7B on cheap VMs

A new Rust-based inference engine called MER allows for efficient streaming of large language models like Mixtral 8x7B from NVMe storage onto less powerful and cheaper virtual machines. This approach bypasses the need for high-end GPUs by loading model experts on demand, caching frequently used ones in RAM, and achieving 3.32 tps on a $0.40/hour VM. The engine demonstrated a 15.56% cache hit rate and is currently CPU-bound, with plans to integrate GPU inference for further performance gains. AI

IMPACT Enables running large models on cheaper hardware, potentially lowering the barrier to entry for AI development and deployment.

RANK_REASON The article details a new inference engine, MER, and its performance benchmarks, which is a software tool.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Randy AP ·

    I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps, here's how

    <h1> I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps — here's how </h1> <p>Most people assume running Mixtral 8x7B requires an A100 with 80GB of VRAM. That's $2-3/hour minimum and most teams don't have access to it. I spent the last several months building M…