PulseAugur
EN
LIVE 03:33:40

MiniMax M3 achieves 9x faster prefill, 15x faster decode with 1M context

MiniMax has released its M3 model, which reportedly uses significantly less compute per token compared to its predecessor. The company claims the new model is nine times faster during prefill and fifteen times faster during decoding, while also supporting a context window of one million tokens. AI

IMPACT This release suggests significant efficiency gains in LLM inference, potentially lowering costs and enabling new applications with larger context windows.

RANK_REASON Frontier-lab model release with system card. [lever_c_demoted from frontier_release: ic=1 ai=1.0]

Read on Mastodon — mastodon.social →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. Mastodon — mastodon.social TIER_1 English(EN) · datarazimedia ·

    MiniMax's M3 runs on about a twentieth of the compute per token of its last model. Vendor figures: 9x faster prefill and 15x faster decode at a 1M-token context

    MiniMax's M3 runs on about a twentieth of the compute per token of its last model. Vendor figures: 9x faster prefill and 15x faster decode at a 1M-token context, via a new sparse attention scheme that only bothers with the relevant bits of the prompt. Net effect: long-context AI …