NVIDIA's Nemotron-3-Super-120B-A12B model, a hybrid Mamba and Mixture-of-Experts architecture, has demonstrated perfect needle retrieval capabilities up to 504,000 tokens. This model utilizes Mamba layers to maintain a constant recurrent state, significantly reducing the computational cost associated with long contexts compared to traditional KV cache methods. Running on four 3090 GPUs with approximately 71GB of VRAM, the model achieved impressive decode speeds at extended context lengths, outperforming comparable full-attention models. AI
IMPACT Demonstrates the potential of Mamba-based architectures for efficient long-context handling in large language models.
RANK_REASON Release of a new model architecture with benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →