Which Selective State Space Model succeeds in the Visual Modality? This weekend I compared two different vision models based on the new Mamba model arch. Its a state space model, which is kind of like an RNN but also has CNN-like qualities to it. These papers came out at the same time so I had them battle head to head.
[1] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (https://arxiv.org/pdf/2401.09417.pdf)
[2] VMamba: Visual State Space Model (https://arxiv.org/pdf/2401.10166.pdf)
The ultimate winner of the “Battle of the Mambas” was VMamba for 3 reasons. Reason 1: the Cross-Scan-Module seemed a better inductive bias than the Bidirectional design. The Cross-Scan-Module uses 4 different directions to turn the image into a sequence: left to right and top to down, right to left and top to down, left to right and bottom to up, and right to left to bottom to up. Reason 2: the VMamba had marginally better results on ImageNet, COCO, and ADE20k. Reason 3: the VMamba snake image used in the figures was better (both papers used generated snake images in the figures).
Connect with your self-custody wallet
Connect with your Coinbase account