PulseAugur
EN
LIVE 23:50:30

Developer creates SM1, a memory-efficient Mamba variant for PyTorch

A developer has created SM1, a variant of the Mamba1 architecture, optimized for PyTorch and capable of running on NVIDIA Blackwell hardware. SM1 replaces the selective scan with two native PyTorch operations, achieving the exact closed-form solution for the d_state=1 recurrence. This optimization significantly reduces memory usage, with a 130M parameter model requiring only 56 KB for its inference state, eliminating the need for a KV cache. AI

IMPACT This optimized Mamba variant could lead to more efficient training and inference for certain sequence modeling tasks.

RANK_REASON Developer created a new model variant based on an existing architecture, detailing its technical implementation and optimizations. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/MachineLearning →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/MachineLearning TIER_1 English(EN) · /u/TechnoVoyager ·

    I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

    <!-- SC_OFF --><div class="md"><p>On windows mamba-ssm is not easily available and doesn't compile on sm_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops:</p> <p><code>L = torch.cumprod(dA, dim=1)</code></p> <p><code>h = L * (h0.unsqueeze(1)…