用于视觉对象跟踪的序列到序列学习

阅读完此论文后，对着代码过一遍思路

原文地址：https://arxiv.org/abs/2304.14394

本文将视觉跟踪建模为一个序列生成问题，以自回归的方式预测目标边界框。抛弃了设计复杂的头网络，采用encoder-decoder transformer architecture，编码器用ViT提取视觉特征(≈OSTrack)，而解码器用因果转换器自回归生成一个边界框值序列。

图像表示

编码器输入模板和搜索图像。现有的跟踪器中模板图像的分辨率通常小于搜索图像的分辨率，SeqTrack使用相同的尺寸，发现在模板中添加更多的背景有助于提高跟踪性能(在其他工作里使用小尺寸模板图像都解释的是为了减少背景干扰，这里说法不是很一致)。

DATA:
  MAX_SAMPLE_INTERVAL: 400
  MEAN:
  - 0.485
  - 0.456
  - 0.406
  SEARCH:
    CENTER_JITTER: 3.5
    FACTOR: 4.0
    SCALE_JITTER: 0.5
    SIZE: 384
    NUMBER: 1
  STD:
  - 0.229
  - 0.224
  - 0.225
  TEMPLATE:
    CENTER_JITTER: 0
    FACTOR: 4.0
    SCALE_JITTER: 0
    SIZE: 384
    NUMBER: 2
  TRAIN:
    DATASETS_NAME:
    - GOT10K_train_full
    DATASETS_RATIO:
    - 1
    SAMPLE_PER_EPOCH: 30000

序列表示

边界框转换为离散序列[ x , y , w , h ] [ x , y , w , h][x,y,w,h]，将每个连续坐标统一离散为[ 1 , nbins]之间的整数。使用共享词汇表V（4000）,V中的每个单词对应一个可学习的嵌入，在训练过程中进行优化。如下代码所示：

class DecoderEmbeddings(nn.Module):
    def __init__(self, vocab_size, hidden_dim, max_position_embeddings, dropout):
        super().__init__()
        self.word_embeddings = nn.Embedding(
            vocab_size, hidden_dim)
        self.position_embeddings = nn.Embedding(
            max_position_embeddings, hidden_dim
        )

        self.LayerNorm = torch.nn.LayerNorm(
            hidden_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        input_embeds = self.word_embeddings(x)
        embeddings = input_embeds

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)

        return embeddings

最终使用一个带softmax的多层感知器，根据输出嵌入对V中的单词进行采样来将嵌入映射回单词。

模型架构

它的基本架构是这样：

SeqTrack的架构，a：左边编码器拿的ViT,右边解码器用的transformer里的。编码器提取视觉特征，解码器利用特征自回归生成边界框序列。b：解码器结构，最下层输入目标序列，先自注意力再与视觉特征做注意力，自回归输出生成目标序列。

解码时加入一个因果注意力掩码（NLP那边用的差不多，防止偷看后边的果）
使用了两个特殊的标记：start和end。开始令牌告诉模型开始生成，而结束令牌则表示生成的完成。

训练时，解码器的输入序列为[start,x,y,w,h] [start,x,y,w,h][start,x,y,w,h]，目标序列为[ x , y , w , h , e n d ] [ x , y , w , h , end][x,y,w,h,end]。（NLP里的）

编码器

去掉了分类用的cls token。
在最后一层附加一个线性投影来对齐编码器和解码器的特征维度。

self.bottleneck = nn.Linear(encoder.num_channels, hidden_dim)

只有搜索图像的特征被送入解码器。

    def forward_features(self, images_list):
        num_template = self.num_template
        template_list = images_list[0:num_template]
        search_list = images_list[num_template:]
        num_search = len(search_list)

        z_list = []
        for i in range(num_template):
            z = template_list[i]
            z = self.patch_embed(z)
            z = z + self.pos_embed[:, self.num_patches_search:, :]
            z_list.append(z)
        z_feat = torch.cat(z_list, dim=1)

        x_list = []
        for i in range(num_search):
            x = search_list[i]
            x = self.patch_embed(x)
            x = x + self.pos_embed[:, :self.num_patches_search, :]
            x_list.append(x)
        x_feat = torch.cat(x_list, dim=1)
        xz_feat = torch.cat([x_feat, z_feat], dim=1)

        xz = self.pos_drop(xz_feat)

        for blk in self.blocks:   #batch is the first dimension.
            if self.use_checkpoint:
                xz = checkpoint.checkpoint(blk, xz)
            else:
                xz = blk(xz)

        xz = self.norm(xz) # B,N,C
        return xz

解码器

接收来自前一个block的词嵌入并利用一个因果关系掩码保证每个序列元素的输出只依赖于其前面的序列元素。

生成因果掩码代码如下：

def generate_square_subsequent_mask(sz):
    r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
        Unmasked positions are filled with float(0.0).
    """

    #each token only can see tokens before them
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float(
        '-inf')).masked_fill(mask == 1, float(0.0))
    return mask

训练和推理

训练

损失函数：通过交叉熵损失最大化target tokens在前一个子序列和输入视频帧上的对数似然。

推理

引入在线模板更新和窗口惩罚，在推理过程中融合先验知识，进一步提高了模型精度和鲁棒性，使用generated tokens的似然来自动选择可靠的动态模板。引入了一种新的窗口惩罚策略，当前搜索区域中心点的离散坐标为[ n_bins / 2 , n_bins / 2],即为上一帧目标中心点位置。在生成x和y时，我们根据整数(即词)与nbins / 2的差来惩罚V中整数(即词)的可能性。差值越大惩罚越大。

实验

创建seqtrack虚拟环境并且激活

conda create -n seqtrack python=3.8
conda activate seqtrack

所需要的安装包如下所示

echo "****************** Installing pytorch ******************"
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
#conda install -y pytorch=1.11 torchvision torchaudio cudatoolkit=11.3 -c pytorch

echo ""
echo ""
echo "****************** Installing yaml ******************"
pip install PyYAML

echo ""
echo ""
echo "****************** Installing easydict ******************"
pip install easydict

echo ""
echo ""
echo "****************** Installing cython ******************"
pip install cython

echo ""
echo ""
echo "****************** Installing opencv-python ******************"
pip install opencv-python

echo ""
echo ""
echo "****************** Installing pandas ******************"
pip install pandas

echo ""
echo ""
echo "****************** Installing tqdm ******************"
conda install -y tqdm

echo ""
echo ""
echo "****************** Installing coco toolkit ******************"
pip install pycocotools

echo ""
echo ""
echo "****************** Installing jpeg4py python wrapper ******************"
pip install jpeg4py

echo ""
echo ""
echo "****************** Installing tensorboard ******************"
pip install tb-nightly

echo ""
echo ""
echo "****************** Installing tikzplotlib ******************"
pip install tikzplotlib

echo ""
echo ""
echo "****************** Installing thop tool for FLOPs and Params computing ******************"
pip install --upgrade git+https://github.com/Lyken17/pytorch-OpCounter.git

echo ""
echo ""
echo "****************** Installing colorama ******************"
pip install colorama

echo ""
echo ""
echo "****************** Installing lmdb ******************"
pip install lmdb

echo ""
echo ""
echo "****************** Installing scipy ******************"
pip install scipy

echo ""
echo ""
echo "****************** Installing visdom ******************"
pip install visdom

echo ""
echo ""
echo "****************** Installing vot-toolkit python ******************"
pip install git+https://github.com/votchallenge/vot-toolkit-python

echo ""
echo ""
echo "****************** Installing timm ******************"
pip install timm==0.5.4

echo ""
echo ""
echo "****************** Installing yacs ******************"
pip install yacs

echo ""
echo ""


echo "****************** Installation complete! ******************"

运行以下命令安装

bash install.sh

将项目路径添加到环境变量

export PYTHONPATH=<absolute_path_of_SeqTrack>:$PYTHONPATH

跟踪数据格式如下所示

${SeqTrack_ROOT}
 -- data
     -- lasot
         |-- airplane
         |-- basketball
         |-- bear
         ...
     -- got10k
         |-- test
         |-- train
         |-- val
     -- coco
         |-- annotations
         |-- images
     -- trackingnet
         |-- TRAIN_0
         |-- TRAIN_1
         ...
         |-- TRAIN_11
         |-- TEST

运行以下命令来设置此项目的路径

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir .

训练SeqTrack

python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training.py --script seqtrack --config seqtrack_b256 --save_dir .

根据基准进行测试和评估这个部分还未全部完成，后续会持续跟进

code

SeqTrack:（SeqTrack-L256为例）

class SEQTRACK(nn.Module):
    """ This is the base class for SeqTrack """
    def __init__(self, encoder, decoder, hidden_dim,
                 bins=1000, feature_type='x', num_frames=1, num_template=1):
        """ Initializes the model.
        Parameters:
            encoder: torch module of the encoder to be used. See encoder.py
            decoder: torch module of the decoder architecture. See decoder.py
        """
        super().__init__()
        self.encoder = encoder
        self.num_patch_x = self.encoder.body.num_patches_search
        self.num_patch_z = self.encoder.body.num_patches_template
        self.side_fx = int(math.sqrt(self.num_patch_x))
        self.side_fz = int(math.sqrt(self.num_patch_z))
        self.hidden_dim = hidden_dim
        self.bottleneck = nn.Linear(encoder.num_channels, hidden_dim) # the bottleneck layer, which aligns the dimmension of encoder and decoder
        self.decoder = decoder
        self.vocab_embed = MLP(hidden_dim, hidden_dim, bins+2, 3)

        self.num_frames = num_frames
        self.num_template = num_template
        self.feature_type = feature_type

        # Different type of visual features for decoder.
        # Since we only use one search image for now, the 'x' is same with 'x_last' here.
        if self.feature_type == 'x':
            num_patches = self.num_patch_x * self.num_frames
        elif self.feature_type == 'xz':
            num_patches = self.num_patch_x * self.num_frames + self.num_patch_z * self.num_template
        elif self.feature_type == 'token':
            num_patches = 1
        else:
            raise ValueError('illegal feature type')

        # position embeding for the decocder
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, hidden_dim))
        pos_embed = get_sinusoid_encoding_table(num_patches, self.pos_embed.shape[-1], cls_token=False)
        self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))

encoder：（ViT）

@register_model
def vit_large_patch16(pretrained=False, pretrain_type='default',
                              search_size=384, template_size=192, **kwargs):
    patch_size = 16
    model = VisionTransformer(
        search_size=search_size, template_size=template_size,
        patch_size=patch_size, num_classes=0,
        embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
    cfg_type = 'vit_large_patch16_224_' + pretrain_type
    if pretrain_type == 'scratch':
        pretrained = False
        return model
    model.default_cfg = default_cfgs[cfg_type]
    if pretrained:
        load_pretrained(model, pretrain_type, num_classes=model.num_classes, in_chans=kwargs.get('in_chans', 3))
    return model

decoder：（DETR）

class SeqTrackDecoder(nn.Module):

    def __init__(self, d_model=512, nhead=8,
                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False,
                 return_intermediate_dec=False, bins=1000, num_frames=9):
        super().__init__()
        self.bins = bins
        self.num_frames = num_frames
        self.num_coordinates = 4 # [x,y,w,h]
        max_position_embeddings = (self.num_coordinates+1) * num_frames
        self.embedding = DecoderEmbeddings(bins+2, d_model, max_position_embeddings, dropout)

        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.body = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
                                          return_intermediate=return_intermediate_dec)

        self._reset_parameters()

        self.d_model = d_model
        self.nhead = nhead