PulseAugur
EN
LIVE 01:57:53

Nemotron-3-Super-120B-A12B achieves 504K token recall with Mamba+MoE architecture

NVIDIA's Nemotron-3-Super-120B-A12B model, a hybrid Mamba and Mixture-of-Experts architecture, has demonstrated perfect needle retrieval capabilities up to 504,000 tokens. This model utilizes Mamba layers to maintain a constant recurrent state, significantly reducing the computational cost associated with long contexts compared to traditional KV cache methods. Running on four 3090 GPUs with approximately 71GB of VRAM, the model achieved impressive decode speeds at extended context lengths, outperforming comparable full-attention models. AI

IMPACT Demonstrates the potential of Mamba-based architectures for efficient long-context handling in large language models.

RANK_REASON Release of a new model architecture with benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Nemotron-3-Super-120B-A12B achieves 504K token recall with Mamba+MoE architecture

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Important_Quote_1180 ·

    Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090

    <table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ugj1sf/nemotron3super120ba12b_hybrid_mambamoe_holds/"> <img alt="Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090" src="https://preview.redd.it/yjkv9o56zo9…