Developer creates SM1, a memory-efficient Mamba variant for PyTorch

By PulseAugur Editorial · [1 sources] · 2026-05-23 05:30

A developer has created SM1, a variant of the Mamba1 architecture, optimized for PyTorch and capable of running on NVIDIA Blackwell hardware. SM1 replaces the selective scan with two native PyTorch operations, achieving the exact closed-form solution for the d_state=1 recurrence. This optimization significantly reduces memory usage, with a 130M parameter model requiring only 56 KB for its inference state, eliminating the need for a KV cache. AI

IMPACT This optimized Mamba variant could lead to more efficient training and inference for certain sequence modeling tasks.

RANK_REASON Developer created a new model variant based on an existing architecture, detailing its technical implementation and optimizations. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/MachineLearning →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

r/MachineLearning TIER_1 English(EN) · /u/TechnoVoyager · 2026-05-23 05:30

I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

<div class="md">On windows mamba-ssm is not easily available and doesn't compile on sm_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops: <code>L = torch.cumprod(dA, dim=1)</code> <code>h = L * (h0.unsqueeze(1)…

COVERAGE [1]

I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

RELATED ENTITIES

RELATED TOPICS