Fireworks AI releases 196B MoE model optimized for inference

By PulseAugur Editorial · [2 sources] · 2026-06-01 23:34

Fireworks AI has released Step 3.7 Flash, a 196-198 billion parameter Mixture-of-Experts (MoE) model. This model was specifically designed with inference efficiency in mind from its inception. The company highlights that many research labs overlook inference optimization until after a model's initial development. AI

IMPACT This model release could offer a more efficient option for inference, potentially lowering costs for AI deployments.

RANK_REASON The cluster describes the release of a new model, but it is not from a tier-1 frontier lab and does not claim state-of-the-art performance on major benchmarks.

Read on X — Fireworks (inference infra) →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Fireworks AI releases 196B MoE model optimized for inference

COVERAGE [2]

X — Fireworks (inference infra) TIER_1 English(EN) · FireworksAI_HQ · 2026-06-04 03:56

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 198B sparse MoE VLM designed by @StepFun_ai for inference from the sta

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 198B sparse MoE VLM designed by @StepFun_ai for inference from the start. 196B language backbone with a 1.8B vision encoder. Built for real-world agent workloads, running at up to 400 https…
X — Fireworks (inference infra) TIER_1 English(EN) · FireworksAI_HQ · 2026-06-01 23:34

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 196B MoE model, and built for inference from the start by @StepFun_ai.

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 196B MoE model, and built for inference from the start by @StepFun_ai. Multi-Matrix Factorization Attention (MFA) → KV-cache at ~22% of DeepSeek. Attention-FFN Disaggregation (AFD) →

COVERAGE [2]

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 198B sparse MoE VLM designed by @StepFun_ai for inference from the sta

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 196B MoE model, and built for inference from the start by @StepFun_ai.

RELATED ENTITIES

RELATED TOPICS