NVIDIA-MambaVision

摘要

  1. 主要工作:integrating Vision Transformers (ViT) with Mamba,

  2. 目的:improves its capacity to capture long-range spatial dependencies

  3. 适用于哪些下游任务:object detection, instance segmentation,and semantic segmentation

  4. 开源链接:GitHub - NVlabs/MambaVision: [CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

引言

  1. transformer训练成本高:the quadratic complexity of the attention mechanism with respect to sequence length makes Transformers computationally expensive to train and deploy

  2. 本篇的前置知识:Vit、Mamba、SSM 等

    Mamba

    通过 new State Space Model (SSM) 关注该关注的,通过ardware-aware considerations并行计算:new State Space Model (SSM) that achieves linear time complexity,enables efficient input-dependent processing of long sequences with ardware-aware considerations.

    与NLP问题不同(从左向右推理没什么大问题),CV问题往往需要从全局信息推理局部信息。如将分出的图片patch按行排列,则左上角的图片与其下方的图片距离过远。

    the Mamba’s autoregressive formulation,limiting its ability to capture and utilize global context in one forward pass

    Vim引入双向SSM来解决这种全局推理局部的问题。

    Vision Mamba (Vim) and others have proposed modifications such as bidirectional SSMs to address lack of global context and spatial understanding.

    问题:

    既然说NLP领域从左到右的因果关系处理问题不大,那么为何会有双向RNN呢?

    reply:

    双向RNN是为了在NLU - 自然语言理解阶段,消除歧义;而单向的RNN用于NLG - 自然语言生成,如翻译任务。

  3. 本文模型的结构:

    our proposed formula- tion (i.e. MambaVision Mixer and MLP) as well as Trans- former blocks.

    CNN处理高分辨,SSM&self-attention处理低分辨

    MambaVision takes the opposite approach with CNNs at higher resolutions and SSM/self-attention at lower ones 相关工作里提到

    金字塔结构:不同模块处理不同分辨率,这里用了mamba block、cnn block

    MambaVision model which consists of a multi-resolution architecture and leverages CNN-based residual blocks for fast feature extraction of larger resolution features.

    self-attention blocks添加在最后一层能够增强global context and long-range spatial dependencies能力:

    self-attention blocks at the final stages can significantly en- hance the capability to capture global context and long-range spatial dependencies.

  4. 评价指标:image throughput、accuracy

    image throughput(图像吞吐量):单位时间内处理的图片数量;

    公式: $Throughput = \frac{Total\ Images\ Processed}{Total\ Time\ (seconds)}$

知识准备

Stage 1-2

Gaussian Error Linear Unit activation function and batch normalization

MambaVision mixer

Mamba mixer 与当前的 MambaVision mixer

we redesigned the original Mamba mixer to make it more suitable for vision tasks

Mamba 的selective scan operation, SiLU的sigma激活函数

Scan is the selective scan operation as in [7] and σ is the activation func- tion for which SiLU is used.

SiLU 函数的表达式非常简单:

$f(x) = x \cdot \sigma(x)$

其中 $\sigma(x)$ 是标准的 Sigmoid 函数:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

Transfomer block

注意力的计算方法 windowed manner ,查看两版swin transformer

our framework allows for computing the attention in a windowed manner