NVIDIA has released Nemotron 3 Ultra, a 550-billion-parameter open model that achieves faster inference speeds than many smaller rivals. This performance is attributed to a hybrid architecture combining Mamba state-space layers with Transformer attention, which mitigates long-context memory bottlenecks. The model also features a LatentMoE design with 512 experts, activating only 22 per token, and incorporates Multi-Token Prediction for native speculative decoding. AI
IMPACT This model's hybrid architecture could influence future large model designs, particularly for agentic tasks requiring long context.
RANK_REASON Frontier-lab model release with system card. [lever_c_demoted from frontier_release: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →