A new version of ResearchHub is available.Try it now
Post
Document
Flag content
2

Battle of the Mambas

Authors
Published
Jan 24, 2024
Save
TipTip
Document
Flag content
2
TipTip
Save
Document
Flag content

Which Selective State Space Model succeeds in the Visual Modality? This weekend I compared two different vision models based on the new Mamba model arch. Its a state space model, which is kind of like an RNN but also has CNN-like qualities to it. These papers came out at the same time so I had them battle head to head.

[1] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (https://arxiv.org/pdf/2401.09417.pdf)

[2] VMamba: Visual State Space Model (https://arxiv.org/pdf/2401.10166.pdf)

The ultimate winner of the “Battle of the Mambas” was VMamba for 3 reasons. Reason 1: the Cross-Scan-Module seemed a better inductive bias than the Bidirectional design. The Cross-Scan-Module uses 4 different directions to turn the image into a sequence: left to right and top to down, right to left and top to down, left to right and bottom to up, and right to left to bottom to up. Reason 2: the VMamba had marginally better results on ImageNet, COCO, and ADE20k. Reason 3: the VMamba snake image used in the figures was better (both papers used generated snake images in the figures).

100%
Discussion


Start the discussion.
This post has not yet been discussed.