DeepSeek has released a preview of its DeepSeek-V4 series of Mixture-of-Experts (MoE) language models, featuring DeepSeek-V4-Pro (1.6T parameters) and DeepSeek-V4-Flash (284B parameters). Both models support an unprecedented one million token context length, achieved through a hybrid attention architecture and an optimized residual connection method. Trained on over 32 trillion tokens, these models demonstrate significant efficiency gains in long-context scenarios, with DeepSeek-V4-Pro requiring substantially less FLOPs and KV cache for inference compared to its predecessor. AI
IMPACT Sets new SOTA for open models in long-context reasoning and efficiency, potentially enabling new classes of AI applications.
RANK_REASON Frontier-lab model release with system card. [lever_c_demoted from frontier_release: ic=1 ai=1.0]
- Compressed Sparse Attention
- DeepSeek-V3.2
- DeepSeek-V4
- DeepSeek-V4-Flash
- DeepSeek-V4-Pro
- Heavily Compressed Attention
- Manifold-Constrained Hyper-Connections
- Mixture-of-Experts
- Muon optimizer
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →