YOLOV++ 详解 | 网络结构、代码解析、YOLOV 论文阅读、初识 VID

前言

代码地址：https://github.com/YuHengsss/YOLOV

本文网络结构按 YOLOV++ SwinTiny 绘制，不同的模型主要差异在于 Backbone，VID 相关的部分基本相同。

Predict

Input

代码基于 vid_demo。首先会读取视频中的所有帧（只能用短视频或者修改代码限制帧数，不然爆内存），然后把帧的顺序随机打乱，按照 gframe=32 划分若干组，每一组作为一个 batch。

Backbone

在这里插入图片描述
整体结构与 YOLO 系列差不多，特征提取网络使用了 SwinTransformer，由于以 Transformer 为基础，而且计算方式比较复杂，本文略过其中细节。C3 模块的计算方式如下：

Head

1. 传统检测

在这里插入图片描述

对于三个不同尺度的特征都会先通过一个 $1\times 1$ 卷积将特征维度统一到256，然后连接多个分支（图中为3个）。
其中两个是传统检测分支，回归分支 reg 负责检测框和是否存在目标，分类分支 cls 负责分类概率；
另有两个分支用于获取后续 VID 环节所需的分类和回归特征（代码中可选择性启用），这里仅启用分类特征的分支得到 vid_cls_feat，而回归特征则直接使用回归分支的中间特征，即 vid_reg_feat = reg_feat。

对于传统分支所得到的检测输出 Output 会经过 Decode 和 PostProcess 得到检测结果。
（1）Decode：将 box 的前两项 xy 加上 Anchor 坐标作为目标框中心坐标，box 的后两项 wh 通过 exp() 作为目标框的长宽。
（2）PostProcess：首先变换 box 格式 xywh -> xyxy，对于每个 batch，提取每个 Anchor 分类得分最高的类别 $35(xyxy+\mathrm{obj}+30) \to 37(xyxy+\mathrm{obj}+\mathrm{conf} +\mathrm{cls}+30)$ 。对 obj * conf 做阈值筛选，并且此处设定了最少目标数量 minimal_limit=50，目标数量不足时按 obj * conf 排序取 Top50，最终得到图中 [32,n,37] 的检测结果。此处每个 batch 的 n 即检测数量并不相同，另外这里可以选择启用 NMS。

2. VID

根据前面得到的检测结果，以及分类特征 vid_cls_feat 和回归特征 vid_reg_feat，提取出检测结果对应 Anchor 位置的特征 feat_cls 和 feat_reg。由于两种特征的处理方式基本相同，以分类特征为例进行说明。
在这里插入图片描述
注：
（1） $N=\sum{n}$ 为当前 batch 总的检测数量。
（2）分类和回归的计算有一处不同，回归在 scale 处不会乘上 reg_score，reg_score 为是否存在目标的得分 obj，cls_score 为最高的分类得分 conf。

attn_cls = (q_cls @ k_cls.transpose(-2, -1)) * self.scale * cls_score * cls_score_mask
attn_reg = (q_reg @ k_reg.transpose(-2, -1)) * self.scale# * fg_score * fg_score_mask

（3）图中省略了对 qkv 的归一化操作，attn @ v 中 v 使用的是归一化前的原始 v

q_cls = q_cls / torch.norm(q_cls, dim=-1, keepdim=True)
k_cls = k_cls / torch.norm(k_cls, dim=-1, keepdim=True)
q_reg = q_reg / torch.norm(q_reg, dim=-1, keepdim=True)
k_reg = k_reg / torch.norm(k_reg, dim=-1, keepdim=True)
v_cls_normed = v_cls / torch.norm(v_cls, dim=-1, keepdim=True)
v_reg_normed = v_reg / torch.norm(v_reg, dim=-1, keepdim=True)
x_cls = (attn @ v_cls).transpose(1, 2).reshape(B, N, C)
x_reg = (attn @ v_reg).transpose(1, 2).reshape(B, N, C)

（4）v_normed 的用途在于计算相似度矩阵，根据下面代码理解计算过程。

通过矩阵乘法得到相似度矩阵 attn_x_raw
对多个头的相似度取均值 attn_x_raw
通过阈值获取相似度较高的位置 x_mask
对 attn 即上图中 attn_cls 和 attn_reg 的均值同样计算多个头的均值，再经过 softmax，记作 attn_new
x_mask 和 sim_round2 相乘并除以总和，所得的还是和为1的概率值，直观上理解就是只有 v 相似度较高的位置取 attn_new 的值得到 mask，作为新的融合特征的权重

attn_cls_raw = v_cls_normed @ v_cls_normed.transpose(-2, -1)
attn_reg_raw = v_reg_normed @ v_reg_normed.transpose(-2, -1)
attn_cls_raw = torch.sum(attn_cls_raw, dim=1, keepdim=False)[0] / self.num_heads
attn_reg_raw = torch.sum(attn_reg_raw, dim=1, keepdim=False)[0] / self.num_heads
# sim_thresh=0.75
sim_mask = torch.where(attn_cls_raw > sim_thresh, ones_matrix, zero_matrix)
# remove ave and conf guide in the reg branch, modified in 2023.12.5
# conf_sim_thresh=0.99
obj_mask = torch.where(attn_reg_raw > conf_sim_thresh, ones_matrix, zero_matrix)

sim_attn = torch.sum(attn, dim=1, keepdim=False)[0] / self.num_heads
sim_round2 = torch.softmax(sim_attn, dim=-1)

ave_mask = sim_mask * sim_round2 / (torch.sum(sim_mask * sim_round2, dim=-1, keepdim=True))
obj_mask = obj_mask * sim_round2 / (torch.sum(obj_mask * sim_round2, dim=-1, keepdim=True))

（5）mask 与输出特征经过如下操作得到最终的分类和回归特征
在这里插入图片描述
个人理解：torch.norm 计算的是向量的 L2 范数，attn_x_raw 即计算每个目标 v 向量的余弦相似度。Attention 本身使用 attn(qk) 来引导 v 做特征的融合，这里用相似度矩阵加阈值来引导 attn，直观上理解就是想过滤掉相似度本身较低的目标之间的相互影响，attn 看作是每个目标对当前目标的权重，mask 便是将相似度较低的目标的权重强制变为 0。Attention 部分和当前使用 Mask 的部分都有 Concat 操作，可以把一个目标的特征向量划分为4份 v, attn + v, (mask + attn_new) + v, (mask + attn_new) + (attn + v)，最终的特征可以看作包含原始特征和3种方式融合的特征。

（6）代码中 decouple_reg = True，以上计算仅保留分类的特征，而回归特征会从头开始重新计算一次。

features_cls, features_reg = self.agg(...)
_, features_reg = self.agg_iou()

个人理解：分类和回归的特征分开融合效果更好，互相影响的点在于 attn = (attn_reg + attn_cls) / 2。从直觉上来说，同时用分类和回归的 attn 来引导特征融合感觉是个不错的选择，但是可能在融合分类特征时，希望 attn_cls 的强度要高于 attn_reg，反之亦是如此。直接设置权重可能比较僵硬（而且所谓的强度也只是猜想），例如 attn_cls_new = 0.7 * attn_cls + 0.3 * attn_reg ，干脆就多个分支，分别训练两种特征的 Attention。

（7）得到的特征经过全连接层输出新的 cls 和 obj 结果
在这里插入图片描述
（8）后处理

cls_preds 和 obj_preds 都经过 sigmoid 替换原始输出对应的结果
筛选分类得分 $\ge 0.001$ 的结果（同一个 Anchor 有多个类别得分都高于阈值会一同保留）
筛选 $\mathrm{cls\ score\times obj\ conf} \ge 0.001$ 的结果
NMS
若 VID 结果中存在目标的帧数 $\le 4$ 则使用最初单帧检测的结果
在绘制结果时，会再使用阈值筛选目标 $\mathrm{cls\ score\times obj\ conf} \ge 0.05$

Train

Data

训练数据为许多图像序列，train_seq.npy 中的内容如下，里面存放了多个图像序列构成的列表。

['ILSVRC2015/Data/VID/train/ILSVRC2015_VID_train_0002/ILSVRC2015_train_00575001/000000.JPEG', 
 ...,
 'ILSVRC2015/Data/VID/train/ILSVRC2015_VID_train_0002/ILSVRC2015_train_00575001/000539.JPEG']
...

先按照默认的训练参数 lframe = 0, gframe = 16 进行说明，把序列中的帧随机打乱，按 gframe 张图像为一组，作为一个 batch。简而言之，训练时一个 batch 的数据全部来自于一个视频，且帧的顺序是随机的。

lframe != 0 的情况是先把所有帧按顺序，以 lframe 张图像为一组划分为 n 组。然后遍历每一组，当前组的所有图像记为 lf，然后在剩余所有组的图像中随机抽取 gframe 张图像记作 gf，lf+gf 作为一个 batch 进行训练。简而言之，是一些连续帧加上随机帧。

Loss

与常见的损失计算并无太多区别，大致为
$loss=w*\mathrm{iou+obj+cls+vid\_obj+vid\_cls}$
其中 $w = 3$ 为权重，除了 $\mathrm{iou}$ 都用 nn.BCEWithLogitsLoss 计算。

总结&想法

1. 输入

感觉只是对于 ILSVRC2015_VID 数据集中的视频有效。挑了几个视频看了下，基本有以下特点：视频长度短、场景或背景变化小、大多是针对一个或多个目标拍摄的视频且目标通常比较大。

2. VID

牵扯到 VID 的部分其实只在 Head 中存在一点，模型在整个 Backbone 的部分和传统检测没有区别，所有的特征都只基于一帧的信息。
VID 部分简单来说就是把所有帧的检测结果对应的特征拿出来，用 Attention 和相似度作为指引做特征融合，利用融合后的特征做新的预测。但新的预测仅限于 obj 和 cls，直观上就是对类别和是否为目标通过前后帧的信息做了修正，对于目标框的位置并没有任何改变。

个人感觉 YOLOV++ 解决的主要问题是：某个目标在单帧中因为遮挡、视角、姿态、模糊等问题可能会被检测成别的类别，而在别的帧中可以清楚的检测到这个目标，于是通过融合特征来修正当前帧对这个目标的判断。但是可以看出，这依赖于 obj * conf 不能太低，如果在单帧检测中置信度不高，那么连修正的机会都没有。
事实上对于时序数据做特征融合是个难点，YOLOV++ 应该是更倾向于检测速度，与单帧检测相比，增加的计算量并不多，不过会与目标数量 N 挂钩，目标越多特征融合的计算量越大，速度越慢。（代码中有与 minimal_limit=50 对应的最大数量限制参数，可以牺牲精度保障速度）

3. 初选 NMS

为了方便，这里把单帧检测的结果称为初选目标，VID 后的结果最终目标。初选目标时默认不启用 NMS，对 YOLOV++ SwinTiny 做了下评估测试。

mAP	不启用 NMS	启用 NMS
total	0.8656	0.8509
slow	0.9071	0.8979
medium	0.8582	0.8385
fast	0.7407	0.7168

评估具体的计算方式就不深究了，总之从结果来看，默认的不启用初选 NMS 效果更好。根据一般经验，对于一个目标通常会有多个 Anchor 的结果指向它，检测框位置、各种置信度略有不同，然后会通过 NMS 筛选得到其中一个对应此目标。在 VID 阶段，仅从特征相似度角度来看，可能同一帧中指向同个目标的多个 Anchor 的相似度会高于不同帧指向同个目标的 Anchor，但从代码中的最少目标限制和不启用初选 NMS 来看，保证融合的目标数量的优先级是更高的。

论文

YOLOV++ 论文还没出，暂读 YOLOV 论文

Related Work

Object Detection in Still Images

就是单帧检测，各种方法略过，重点在 One-stage 和 Two-stage 对于 VID 的影响。Two-stage 会先经过 RPN + ROI 得到候选区域的特征，One-stage 缺少这种特征，而这个特征在 VID 的特征融合中很好用。YOLOV 就是探索用 One-stage 的特征做融合的可行性。

One-stage detectors are usually faster than two-stage ones, owing to the end-to-end manner. However, they lack explicit region-level semantic features that are widely used for feature aggregation in video object detection.

Object Detection in Videos

文中把现有的 VID 方法大致分为两个分支

One branch of existing video object detectors concentrate on tracklet-level post-processing. The methods in this category try to refine the prediction results from the still image detector in consecutive frames by forming the object tubelets. The final classification score of each box is adjusted according to the entire tubelet.
2016 -《Seq-NMS for video object detection》
2019 - VISIGRAPP -《Improving Video Object Detection by Seq-Bbox Matching》
2020 - IROS -《Robust and efficient post-processing for video object detection》

第一个分支并没有太多介绍，看意思是单纯在后处理阶段通过整合对连续帧的检测结果来修正每个 box 的分类得分。下面的第二个分支是重点。

Another branch aims to enhance the features of keyframes, expecting to alleviate degradation via utilizing the features from (selected) reference frames. These approaches can be roughly classified as optical flow-based, attention-based and tracking-based methods.
（1）optical flow-based
2017 - ICCV -《Flow-guided feature aggregation for video object detection》
2018 - CVPR -《Towards high performance video object detection》
（2）attention-based
2019 - ICCV -《Sequence level semantics aggregation for video object detection》
2019 - ICCV -《Relation distillation networks for video object detection》
2020 - CVPR -《Memory enhanced global-local aggregation for video object detection》
2021 - AAAI -《Temporal ROI align for video object recognition》
2021 - AAAI -《Mamba: Multi-level aggregation via memory bank for video object detection》
（3）tracking-based
2017 - ICCV -《Detect to track and track to detect》
2018 -《Integrated object detection and tracking with tracklet-conditioned detection》

（1）optical flow-based
Deep Feature Flow 最先引入光流做图像级特征对齐，FGFA 使用光流沿运动路径融合特征。由于图像级特征融合计算成本太高，开发了 attention-based 方法。

（2）attention-based
SESLA 根据 region-level 特征的语义相似性提出了 long-range 特征融合方法。受 Relation Networks 中 relation 模块的启发，RDN 提取了目标在空间和时间中的上下文关系。MEGA 设计了一个 memory enhanced global-local aggregation module 来更好地对目标之间的关系进行建模。TROIA 利用 ROI-Align 进行细粒度特征融合，HVR-Net 整合了视频内外的关系做进一步改进。MBMBA 通过 memory bank 扩大 reference feature set。QueryProp 通过 lightweight query propagation module 来提高 VID 速度。

（3）tracking-based
D&T 通过构建不同帧特征的关联图（correlation maps），以 tracking 方式解决 VID 问题。

最后提到以上这些方法大多基于 two-stage detectors，推理速度慢。

所谓的 Image-level 图像级特征融合看起来就是会对全图的特征做融合，而后续的 attention-based 方法看起来是对局部，譬如仅对 region proposal 做特征融合来减少计算量，并且通常计算光流也很费时。

Methodology

（1）由于特征融合模块的权重是随机初始化的，如果直接在原始分类分支上融合特征并做反向传播反而会污染原本的权重，因此多了一个分支，也就是之前图中的 vid_cls_feat，但文中没说为什么回归分支不用一样加个分支。

In practice, we found that directly aggregating the collected features in the classification branch and backpropagating the classification loss of the aggregated features will result in unstable training. Since the weight of the feature aggregation module is randomly initialized, finetuning all the networks from the beginning will contaminate the pre-trained weights. To address the above concerns, we fix the weights in the base detector except for the linear projection layers in detection head. We further insert two 3 × 3 convolutional (Conv) layers into the model neck as a new branch, called video object classification branch, which generates features for aggregation. Then, we feed the collected features from the video and regression branches into our feature aggregation module.

（2）这里的 same problem 也就是 homogeneity issue 有点不清楚。猜测是使用相似度会导致跟差的目标匹配的同样是差的 proposal，例如对于因为运动模糊导致不容易检测的一只狗，与其最为匹配的不是在其他帧中静止清晰的狗，而是同样模糊的一些目标。文中具体解决这个问题的方式看起来是将分类和回归的相似度分离来引导 V 的融合，具体方式与 YOLOV++ 略有不同（get 不到为什么这样可以解决这种问题）。最后提到在较长的时间序列中不使用位置编码。

Simply referring to the cosine similarity will find features most similar to the target. However, when the target suffers from some degradation, the selected reference proposals using cosine similarity are very likely to have the same problem. We name this phenomenon the homogeneity issue.
…
To overcome the homogeneity issue, we further take predicted confidences from the raw detector into consideration, …
The positional information is not embedded, because the locations in a long temporal range would not be helpful as claimed in MEGA.