I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]
A developer has created SM1, a variant of the Mamba1 architecture, optimized for PyTorch and capable of running on NVIDIA Blackwell hardware. SM1 replaces the selective scan with two native PyTorch operations, achieving the exact closed-form solution for the d_state=1 recurrence. This optimization significantly reduces memory usage, with a 130M parameter model requiring only 56 KB for its inference state, eliminating the need for a KV cache. AI
IMPACT This optimized Mamba variant could lead to more efficient training and inference for certain sequence modeling tasks.