Battle of the Mambas

Machine Learning

Computer Vision

Authors

Hu Po

Published

Jan 24, 2024

Save

Tip

Document

Flag content

Tip

Save

Document

Flag content

Which Selective State Space Model succeeds in the Visual Modality? This weekend I compared two different vision models based on the new Mamba model arch. Its a state space model, which is kind of like an RNN but also has CNN-like qualities to it. These papers came out at the same time so I had them battle head to head.

[1] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (https://arxiv.org/pdf/2401.09417.pdf)

[2] VMamba: Visual State Space Model (https://arxiv.org/pdf/2401.10166.pdf)

The ultimate winner of the “Battle of the Mambas” was VMamba for 3 reasons. Reason 1: the Cross-Scan-Module seemed a better inductive bias than the Bidirectional design. The Cross-Scan-Module uses 4 different directions to turn the image into a sequence: left to right and top to down, right to left and top to down, left to right and bottom to up, and right to left to bottom to up. Reason 2: the VMamba had marginally better results on ImageNet, COCO, and ADE20k. Reason 3: the VMamba snake image used in the figures was better (both papers used generated snake images in the figures).

100%

Discussion

Start the discussion.

This post has not yet been discussed.

Scan to connect with one of our mobile apps

Coinbase Wallet app

Connect with your self-custody wallet

Coinbase app

Connect with your Coinbase account

Open Coinbase Wallet app
Tap Scan

Or try the Coinbase Wallet browser extension

Connect with dapps with just one click on your desktop browser
Add an additional layer of security by using a supported Ledger hardware wallet