【Code Reading】Transformer in vision and video

news2025/2/21 0:14:48

文章目录

1. vit
2. Swin-t
3. vit_3D
4. TimeSformer First🚀🚀
5. vivit

1. vit

详细解释

在这里插入图片描述

在这里插入图片描述
在论文的Table1中有给出三个模型（Base/ Large/ Huge）的参数，在源码中除了有Patch Size为16x16的外还有32x32的。其中的Layers就是Transformer Encoder中重复堆叠Encoder Block的次数，Hidden Size就是对应通过Embedding层后每个token的dim（向量的长度），MLP size是Transformer Encoder中MLP Block第一个全连接的节点个数（是Hidden Size的四倍），Heads代表Transformer中Multi-Head Attention的heads数。

2. Swin-t

在这里插入图片描述

3. vit_3D

您将需要传递两个额外的超参数：
(1) 帧数frames 和(2) 沿帧维度的patch大小frame_patch_size

class ViT3D(nn.Module):
    def __init__(self, *, image_size, image_patch_size, frames, frame_patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size) # 128 128 
        patch_height, patch_width = pair(image_patch_size) # 16 16

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
        assert frames % frame_patch_size == 0, 'Frames must be divisible by frame patch size' # 16 2

        num_patches = (image_height // patch_height) * (image_width // patch_width) * (frames // frame_patch_size) # 每一个frame块中有多少块patch
        patch_dim = channels * patch_height * patch_width * frame_patch_size # 3*16*16*2=1536

        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (f pf) (h p1) (w p2) -> b (f h w) (p1 p2 pf c)', p1 = patch_height, p2 = patch_width, pf = frame_patch_size),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim), # -->1024 token dim（Hidden Size）
            nn.LayerNorm(dim),
        )

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width
        
        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
            nn.LayerNorm(dim),
        )

只有在num_patches和patch_embedding部分，多了一些关于frame的操作。将视频输入数据经过patch_embedding转化为token后，就没什么区别了。（包括后面加【cls】和position embedding）

4. TimeSformer First🚀🚀

第一篇使用纯Transformer结构在视频识别上的文章。

初探Video Transformer（一）：抛弃CNN的纯Transformer视频理解框架—TimeSformer

在这里插入图片描述

在这里插入图片描述
5种方法的可视化：

self-attention和rnn计算复杂度的对比

自注意力的计算复杂度和 输入token的数量平方 成正比。
在这里插入图片描述

5. vivit

初探Video Transformer（二）：谷歌开源更全面、高效的无卷积视频分类模型ViViT

为了让模型表现力更强，ViViT讨论了两方面的设计和思考(TimeSformer只考虑了模型结构设计)：
Embedding video clips 和 Transformer Models for Video

在这里插入图片描述
TimeSformer：通过reshape，可以分别对2D和时间维度分别进行embedding。

Embedding video clips：

在这里插入图片描述
如何将输入的Video数据转化为tokens。（token_dim=N*d）

在这里插入图片描述

在这里插入图片描述
与"Uniform frame sampling"相比，这种方法融合了时空信息。

Transformer Models for Video：

在这里插入图片描述
这种实现方法可以理解为vit-3D。和TimeSformer的joint attention相似。

Model 2: Factorised encoder

在这里插入图片描述

Model 3: Factorised self-attention

在这里插入图片描述

Model 4: Factorised dot-product attention

在这里插入图片描述

Model-1
model1就是标准的VIT结构，除了patchembeeding以外没有任何的改变，直接看vit-3d代码就可以了。
Model-2
第一个Transformer：先对同一帧下的token进行交互interaction，产生每个时间索引下的latent representation。
第二个Transformer：对time stepinx交互。相当于时间空间信息后融合。

self.to_patch_embedding = nn.Sequential(
      Rearrange('b c (f pf) (h p1) (w p2) -> b f (h w) (p1 p2 pf c)', p1 = patch_height, p2 = patch_width, pf = frame_patch_size),
      nn.LayerNorm(patch_dim),
      nn.Linear(patch_dim, dim),
      nn.LayerNorm(dim)
  )

相比于VIT 3D的b c (f pf) (h p1) (w p2) -> b (f h w) (p1 p2 pf c)，这里把时间维度单独抽出来。

在forward中：相比于之前的patch_embedding、cls_tokens 、pos_embedding。

    def forward(self, video):
        x = self.to_patch_embedding(video)
        b, n, _ = x.shape

        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)

        x = self.transformer(x)

        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]

        x = self.to_latent(x)
        return self.mlp_head(x)

model2中：
如果Transformer输入有【cls】，则分类。若无，则全局平均池化输出。

def forward(self, video):
    x = self.to_patch_embedding(video) # b f (h w) (p1 p2 pf c)
    b, f, n, _ = x.shape # n为一帧内有多少个patches，_为token—dim

    x = x + self.pos_embedding[:, :f, :n] # 先pos_embedding

    if exists(self.spatial_cls_token):
        spatial_cls_tokens = repeat(self.spatial_cls_token, '1 1 d -> b f 1 d', b = b, f = f)
        x = torch.cat((spatial_cls_tokens, x), dim = 2) # 后对spatial_cls_tokens 进行添加

    x = self.dropout(x)

    x = rearrange(x, 'b f n d -> (b f) n d') # 融合时间维度

    # attend across space

    x = self.spatial_transformer(x)  # 进行空间的变换 输入输出维度不变(b f) n d

    x = rearrange(x, '(b f) n d -> b f n d', b = b)

    # excise out the spatial cls tokens or average pool for temporal attention

    x = x[:, :, 0] if not self.global_average_pool else reduce(x, 'b f n d -> b f d', 'mean')# [4, 8, 1024] 提取第一个cls，当作当前帧所有token的全局表示

    # append temporal CLS tokens

    if exists(self.temporal_cls_token):
        temporal_cls_tokens = repeat(self.temporal_cls_token, '1 1 d-> b 1 d', b = b)

        x = torch.cat((temporal_cls_tokens, x), dim = 1)

    # attend across time

    x = self.temporal_transformer(x)

    # excise out temporal cls token or average pool

    x = x[:, 0] if not self.global_average_pool else reduce(x, 'b f d -> b d', 'mean') # [4, 1024] 提取第一个cls，当作跨帧所有token的全局表示

    x = self.to_latent(x)
    return self.mlp_head(x)

Model-3
model3的实现和TimeSformer的实现是一样的，去掉cls-token即可，可以参考TimeSformer的文章。
Model-4
model4的实现与model1不同之处在于，transformer是有两个不同维度的attention 来进行计算的。
代码以后填吧。

效果对比：

在这里插入图片描述

在这里插入图片描述
比较模型性能，这里Model2的temporal-transformer设置4层。model1的性能最好，但是FLOPs最大，运行时间最长。Model4没有额外的参数量，计算量比model1少很多，但是性能不高。Model3相比与其他的变体，需要更高的计算量和参数量。Model2表现最佳，精度尚可，计算量和运行时比较低。最后一行是单独做的实验，去掉了Model2的temporal transformer，直接在帧上做了pooling，EK上的精度下降很多，对于时序强的数据集需要用temporal transformer来做时序信息交互。