【目标检测】【YOLOv4】YOLOv4：目标检测的最佳速度与精度

YOLOv4：目标检测的最佳速度与精度
在这里插入图片描述

0.论文摘要

有许多特征被认为可以提高卷积神经网络（CNN）的准确性。需要在大规模数据集上对这些特征的组合进行实际测试，并对结果进行理论上的验证。某些特征仅适用于特定模型和特定问题，或仅适用于小规模数据集；而一些特征，如批量归一化和残差连接，适用于大多数模型、任务和数据集。我们假设这些通用特征包括加权残差连接（WRC）、跨阶段部分连接（CSP）、跨小批量归一化（CmBN）、自对抗训练（SAT）和Mish激活函数。我们使用了新的特征：WRC、CSP、CmBN、SAT、Mish激活函数、Mosaic数据增强、CmBN、DropBlock正则化和CIoU损失，并将其中一些特征结合起来，实现了最先进的结果：在Tesla V100上以约65 FPS的实时速度，在MS COCO数据集上达到了43.5% AP（65.7% AP50）的精度。

1.引言

大多数基于CNN的目标检测器主要仅适用于推荐系统。例如，通过城市摄像头搜索免费停车位是由速度较慢但精度较高的模型执行的，而汽车碰撞预警则与速度快但精度较低的模型相关。提高实时目标检测器的准确性，不仅使其能够用于生成提示的推荐系统，还能用于独立的流程管理和减少人工输入。在传统图形处理器（GPU）上运行的实时目标检测器，使其能够以可承受的价格大规模使用。最精确的现代神经网络无法实时运行，并且需要大量GPU进行大批量训练。我们通过创建一个在传统GPU上实时运行的CNN来解决这些问题，并且其训练仅需一个传统GPU。

这项工作的主要目标是设计一个在生产系统中快速运行的目标检测器，并优化并行计算，而不是追求低计算量的理论指标（BFLOP）。我们希望设计的对象能够易于训练和使用。例如，任何使用常规GPU进行训练和测试的人都可以获得实时、高质量且令人信服的目标检测结果，如图1所示的YOLOv4结果。我们的贡献总结如下：

我们开发了一个高效且强大的目标检测模型，使得每个人都可以使用1080 Ti或2080 Ti GPU训练出超快速且准确的目标检测器。
我们验证了在检测器训练过程中，最先进的Bag-of-Freebies和Bag-of-Specials方法对目标检测的影响。
我们改进了最先进的方法，使其更高效且适合单GPU训练，包括CBN [89]、PAN [49]、SAM [85]等。

图1：提出的YOLOv4与其他最先进目标检测器的对比。YOLOv4在性能相当的情况下，运行速度比EfficientDet快两倍。与YOLOv3相比，AP和FPS分别提高了10%和12%。

2.相关工作

2.1 目标检测模型

现代检测器通常由两部分组成：一个是在ImageNet上预训练的主干网络，另一个是用于预测物体类别和边界框的头部网络。对于运行在GPU平台上的检测器，其主干网络可能是VGG [68]、ResNet [26]、ResNeXt [86]或DenseNet [30]。对于运行在CPU平台上的检测器，其主干网络可能是SqueezeNet [31]、MobileNet [28, 66, 27, 74]或ShuffleNet [97, 53]。至于头部网络，通常分为两类，即单阶段目标检测器和两阶段目标检测器。最具代表性的两阶段目标检测器是R-CNN [19]系列，包括fast R-CNN [18]、faster R-CNN [64]、R-FCN [9]和Libra R-CNN [58]。也可以将两阶段目标检测器设计为无锚框目标检测器，例如RepPoints [87]。对于单阶段目标检测器，最具代表性的模型是YOLO [61, 62, 63]、SSD [50]和RetinaNet [45]。近年来，无锚框单阶段目标检测器得到了发展，这类检测器包括CenterNet [13]、CornerNet [37, 38]、FCOS [78]等。近年来开发的目标检测器通常在主干网络和头部网络之间插入一些层，这些层通常用于从不同阶段收集特征图。我们可以称之为目标检测器的颈部网络。通常，颈部网络由若干自下而上的路径和若干自上而下的路径组成。采用这种机制的网络包括特征金字塔网络（FPN）[44]、路径聚合网络（PAN）[49]、BiFPN [77]和NAS-FPN [17]。

除了上述模型外，一些研究者将重点放在直接构建新的骨干网络（如DetNet [43]、DetNAS [7]）或全新的整体模型（如SpineNet [12]、HitDetector [20]）上，以用于目标检测。

综上所述，一个普通的物体检测器由以下几个部分组成：
• 输入：图像、图像块、图像金字塔
• 主干网络：VGG16 [68]、ResNet-50 [26]、SpineNet [12]、EfficientNet-B0/B7 [75]、CSPResNeXt50 [81]、CSPDarknet53 [81]
• 颈部：
• 附加模块：SPP [25]、ASPP [5]、RFB [47]、SAM [85]
• 路径聚合模块：FPN [44]、PAN [49]、NAS-FPN [17]、全连接FPN、BiFPN [77]、ASFF [48]、SFAM [98]
• 头部：
• 密集预测（单阶段）：
◦ RPN [64]、SSD [50]、YOLO [61]、RetinaNet [45]（基于锚点）
◦ CornerNet [37]、CenterNet [13]、MatrixNet [60]、FCOS [78]（无锚点）
• 稀疏预测（两阶段）：
◦ Faster R-CNN [64]、R-FCN [9]、Mask RCNN [23]（基于锚点）
◦ RepPoints [87]（无锚点）

2.2 免费赠品

通常，传统的目标检测器是离线训练的。因此，研究人员总是喜欢利用这一优势，开发更好的训练方法，使目标检测器在不增加推理成本的情况下获得更高的精度。我们将这些仅改变训练策略或仅增加训练成本的方法称为“免费午餐”。目标检测方法中经常采用且符合“免费午餐”定义的是数据增强。数据增强的目的是增加输入图像的多样性，从而使设计的目标检测模型对不同环境下获取的图像具有更高的鲁棒性。例如，光度失真和几何失真是两种常用的数据增强方法，它们无疑对目标检测任务有益。在处理光度失真时，我们调整图像的亮度、对比度、色调、饱和度和噪声。对于几何失真，我们添加随机缩放、裁剪、翻转和旋转。

上述数据增强方法都是基于像素级别的调整，调整区域内的所有原始像素信息均被保留。此外，一些从事数据增强的研究者将重点放在模拟物体遮挡问题上，并在图像分类和目标检测中取得了良好的效果。例如，随机擦除（Random Erase）[100] 和 CutOut [11] 可以随机选择图像中的矩形区域，并用随机值或零的补值填充。至于 Hide-and-Seek [69] 和 Grid Mask [6]，它们随机或均匀地选择图像中的多个矩形区域，并将其替换为零。如果将类似的概念应用于特征图，则有 DropOut [71]、DropConnect [80] 和 DropBlock [16] 等方法。此外，一些研究者提出了使用多张图像共同进行数据增强的方法。例如，MixUp [92] 使用两张图像以不同的系数比例相乘并叠加，然后根据这些叠加比例调整标签。至于 CutMix [91]，则是将裁剪后的图像覆盖到其他图像的矩形区域，并根据混合区域的大小调整标签。除了上述方法外，风格迁移 GAN [15] 也被用于数据增强，这种用法可以有效减少 CNN 学习到的纹理偏差。

与上述提出的各种方法不同，还有一些其他的“免费技巧”方法致力于解决数据集中语义分布可能存在偏差的问题。在处理语义分布偏差问题时，一个非常重要的问题是不同类别之间存在数据不平衡的问题，这一问题通常通过两阶段目标检测器中的难负样本挖掘[72]或在线难样本挖掘[67]来解决。然而，样本挖掘方法并不适用于一阶段目标检测器，因为这类检测器属于密集预测架构。因此，Lin等人[45]提出了焦点损失（focal loss）来处理各类别之间存在的数据不平衡问题。另一个非常重要的问题是，使用one-hot硬标签表示法难以表达不同类别之间的关联程度。这种表示方案通常在执行标注时使用。[73]中提出的标签平滑（label smoothing）方法将硬标签转换为软标签进行训练，可以使模型更加鲁棒。为了获得更好的软标签，Islam等人[33]引入了知识蒸馏的概念，设计了标签精炼网络。

最后一个免费赠品是边界框（BBox）回归的目标函数。传统的目标检测器通常使用均方误差（MSE）直接对BBox的中心点坐标和高度、宽度进行回归，即{x_center, y_center, w, h}，或者对左上点和右下点进行回归，即{x_top_left, y_top_left, x_bottom_right, y_bottom_right}。对于基于锚点的方法，则是估计相应的偏移量，例如{x_center_offset, y_center_offset, w_offset, h_offset}和{x_top_left_offset, y_top_left_offset, x_bottom_right_offset, y_bottom_right_offset}。然而，直接估计BBox每个点的坐标值是将这些点视为独立变量，但实际上并未考虑物体本身的完整性。为了更好地处理这一问题，一些研究者最近提出了IoU损失[90]，它将预测BBox区域与真实BBox区域的覆盖范围纳入考虑。IoU损失的计算过程会通过与真实值执行IoU来触发BBox四个坐标点的计算，然后将生成的结果连接成一个整体代码。由于IoU是一种尺度不变的表示方法，它可以解决传统方法在计算{x, y, w, h}的l1或l2损失时，损失会随着尺度增加的问题。最近，一些研究者继续改进IoU损失。例如，GIoU损失[65]除了覆盖范围外，还考虑了物体的形状和方向。他们提出找到能够同时覆盖预测BBox和真实BBox的最小区域BBox，并使用该BBox作为分母来替换IoU损失中原本使用的分母。至于DIoU损失[99]，它额外考虑了物体中心的距离，而CIoU损失[99]则同时考虑了重叠区域、中心点之间的距离以及宽高比。CIoU在BBox回归问题上可以实现更好的收敛速度和精度。

2.3 Bag of specials

对于那些仅略微增加推理成本但能显著提高目标检测准确率的插件模块和后处理方法，我们称之为“特殊技巧包”。一般来说，这些插件模块用于增强模型中的某些属性，例如扩大感受野、引入注意力机制或加强特征整合能力等，而后处理则是一种用于筛选模型预测结果的方法。

可用于增强感受野的常见模块包括SPP [25]、ASPP [5]和RFB [47]。SPP模块源自空间金字塔匹配（SPM）[39]，SPM的原始方法是将特征图分割为若干个d × d的等大小块，其中d可以是{1, 2, 3, …}，从而形成空间金字塔，然后提取词袋特征。SPP将SPM集成到CNN中，并使用最大池化操作代替词袋操作。由于He等人[25]提出的SPP模块会输出一维特征向量，因此无法直接应用于全卷积网络（FCN）。因此，在YOLOv3 [63]的设计中，Redmon和Farhadi对SPP模块进行了改进，将其改为使用不同核大小k × k（其中k = {1, 5, 9, 13}）的最大池化输出进行拼接，且步长为1。在这种设计下，较大的k × k最大池化操作有效地增加了骨干特征的感受野。在添加了改进版的SPP模块后，YOLOv3-608在MS COCO目标检测任务中的AP50提升了2.7%，同时仅增加了0.5%的计算开销。ASPP [5]模块与改进版SPP模块的主要区别在于，ASPP将原始的k × k核大小、步长为1的最大池化操作替换为多个3 × 3核大小、膨胀率为k、步长为1的膨胀卷积操作。RFB模块则是通过使用多个k × k核大小、膨胀率为k、步长为1的膨胀卷积操作，以获得比ASPP更全面的空间覆盖。RFB [47]仅增加了7%的推理时间，就将SSD在MS COCO上的AP50提升了5.7%。

在目标检测中常用的注意力模块主要分为通道注意力和点注意力，这两种注意力模型的代表分别是Squeeze-and-Excitation (SE) [29] 和 Spatial Attention Module (SAM) [85]。尽管SE模块可以在ImageNet图像分类任务中将ResNet50的top-1准确率提高1%，而计算量仅增加2%，但在GPU上通常会使推理时间增加约10%，因此更适合在移动设备上使用。而对于SAM来说，它只需额外支付0.1%的计算量，便能在ImageNet图像分类任务中将ResNet50-SE的top-1准确率提高0.5%。最重要的是，它完全不会影响GPU上的推理速度。

在特征融合方面，早期的实践是使用跳跃连接[51]或超列[22]将低层次的物理特征与高层次的语义特征进行融合。自从FPN等多尺度预测方法流行以来，许多用于融合不同特征金字塔的轻量级模块被提出。这类模块包括SFAM[98]、ASFF[48]和BiFPN[77]。SFAM的主要思想是使用SE模块对多尺度拼接的特征图进行通道级别的重加权。而ASFF则使用softmax进行点级别的重加权，然后将不同尺度的特征图相加。在BiFPN中，提出了多输入加权残差连接来执行尺度级别的重加权，然后将不同尺度的特征图相加。

在深度学习的研究中，一些人将注意力集中在寻找良好的激活函数上。一个好的激活函数可以使梯度更有效地传播，同时不会带来过多的额外计算成本。2010年，Nair和Hinton [56]提出了ReLU，以从根本上解决传统tanh和sigmoid激活函数中常见的梯度消失问题。随后，LReLU [54]、PReLU [24]、ReLU6 [28]、Scaled Exponential Linear Unit (SELU) [35]、Swish [59]、hard-Swish [27]和Mish [55]等也被提出，用于解决梯度消失问题。LReLU和PReLU的主要目的是解决ReLU在输出小于零时梯度为零的问题。至于ReLU6和hard-Swish，它们专门为量化网络设计。为了实现神经网络的自我归一化，提出了SELU激活函数。需要注意的是，Swish和Mish都是连续可微的激活函数。

在基于深度学习的物体检测中，常用的后处理方法是NMS（非极大值抑制），它可以用来过滤那些对同一物体预测效果较差的边界框（BBox），并仅保留响应较高的候选边界框。NMS试图改进的方式与优化目标函数的方法一致。NMS最初提出的方法并未考虑上下文信息，因此Girshick等人在R-CNN中加入了分类置信度分数作为参考，并根据置信度分数的顺序，从高分到低分进行贪婪NMS。至于soft NMS，它考虑了物体遮挡可能导致贪婪NMS中IoU分数置信度下降的问题。DIoU NMS的开发者的思路是在soft NMS的基础上，将中心点距离的信息加入到边界框筛选过程中。值得一提的是，由于上述后处理方法均未直接参考捕获的图像特征，因此在后续无锚点方法的发展中，不再需要后处理。

3.方法

基本目标是实现神经网络在生产系统中的快速运行速度，并优化并行计算，而非低计算量的理论指标（BFLOP）。我们提出了两种实时神经网络的方案：
• 对于GPU，我们在卷积层中使用少量分组（1 - 8）：CSPResNeXt50 / CSPDarknet53
• 对于VPU，我们使用分组卷积，但避免使用Squeeze-and-excitation（SE）模块——具体包括以下模型：EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3

3.1 架构的选择

我们的目标是找到输入网络分辨率、卷积层数、参数数量（滤波器大小² * 滤波器数 * 通道数 / 组数）和层输出数（滤波器数）之间的最佳平衡。例如，我们的多项研究表明，在ILSVRC2012（ImageNet）数据集上的物体分类任务中，CSPResNext50明显优于CSPDarknet53 [10]。然而，相反地，在MS COCO数据集上的物体检测任务中，CSPDarknet53则优于CSPResNext50 [46]。

下一个目标是选择额外的模块以增加感受野，并为不同检测器层级选择最佳的参数聚合方法，例如FPN、PAN、ASFF、BiFPN。

对于分类任务最优的参考模型并不总是对检测器最优。与分类器相比，检测器需要以下内容：
• 更高的输入网络尺寸（分辨率）——用于检测多个小尺寸物体
• 更多的层——用于更大的感受野以覆盖增加的输入网络尺寸
• 更多的参数——用于提高模型在单张图像中检测不同尺寸物体的能力

假设来说，我们可以认为应该选择具有更大感受野大小（包含更多3×3卷积层）和更多参数的模型作为骨干网络。表1展示了CSPResNeXt50、CSPDarknet53和EfficientNet B3的信息。CSPResNext50仅包含16个3×3卷积层，感受野为425×425，参数数量为20.6M，而CSPDarknet53包含29个3×3卷积层，感受野为725×725，参数数量为27.6M。这一理论依据，加上我们的大量实验，表明CSPDarknet53神经网络是两者中作为检测器骨干网络的最优模型。

在这里插入图片描述

不同大小的感受野的影响总结如下：
• 达到物体大小——可以观察整个物体
• 达到网络大小——可以观察物体周围的上下文
• 超过网络大小——增加图像点与最终激活之间的连接数量

我们在CSPDarknet53上添加了SPP模块，因为它显著增加了感受野，分离出最重要的上下文特征，并且几乎不会降低网络运行速度。我们使用PANet作为从不同骨干层级到不同检测器层级的参数聚合方法，而不是YOLOv3中使用的FPN。

最后，我们选择了CSPDarknet53作为主干网络，SPP附加模块，PANet路径聚合颈部，以及YOLOv3（基于锚点）的头部作为YOLOv4的架构。

未来，我们计划显著扩展检测器的“免费礼包”（BoF）内容，理论上可以解决一些问题并提高检测器的准确性，并通过实验方式依次检查每个特征的影响。

我们没有使用跨GPU批量归一化（CGBN或SyncBN）或昂贵的专用设备。这使得任何人都可以在传统的图形处理器（如GTX 1080Ti或RTX 2080Ti）上复现我们的最先进成果。

3.2 Selection of BoF and BoS

为了改进目标检测训练，卷积神经网络（CNN）通常使用以下方法：
• 激活函数：ReLU、Leaky-ReLU、参数化ReLU、ReLU6、SELU、Swish或Mish
• 边界框回归损失：均方误差（MSE）、交并比（IoU）、广义交并比（GIoU）、完全交并比（CIoU）、距离交并比（DIoU）
• 数据增强：CutOut、MixUp、CutMix
• 正则化方法：DropOut、DropPath [36]、空间DropOut [79]或DropBlock
• 通过均值和方差对网络激活进行归一化：批归一化（BN）[32]、跨GPU批归一化（CGBN或SyncBN）[93]、滤波器响应归一化（FRN）[70]或跨迭代批归一化（CBN）[89]
• 跳跃连接：残差连接、加权残差连接、多输入加权残差连接或跨阶段部分连接（CSP）

关于训练激活函数的选择，由于PReLU和SELU较难训练，而ReLU6专为量化网络设计，因此我们将上述激活函数从候选列表中移除。在正则化方法的选择上，DropBlock的发布者已将其与其他方法进行了详细比较，并且他们的正则化方法表现优异。因此，我们毫不犹豫地选择了DropBlock作为我们的正则化方法。至于归一化方法的选择，由于我们专注于仅使用单个GPU的训练策略，因此不考虑使用syncBN。

3.3 附加改进

为了使设计的检测器更适合在单GPU上进行训练，我们进行了以下额外的设计和改进：
• 我们引入了一种新的数据增强方法Mosaic和自对抗训练（SAT）
• 在应用遗传算法时，我们选择了最优的超参数
• 我们修改了一些现有方法，以使我们的设计适合高效训练和检测——改进的SAM、改进的PAN以及跨小批量归一化（CmBN）

Mosaic代表了一种新的数据增强方法，它将4张训练图像进行混合。因此，4种不同的上下文被混合在一起，而CutMix仅混合2张输入图像。这使得能够在正常上下文之外检测到物体。此外，批量归一化在每一层上从4张不同的图像中计算激活统计量。这显著减少了对大批量大小的需求。

自对抗训练（SAT）也是一种新的数据增强技术，它分为两个前向-后向阶段。在第一阶段，神经网络不是调整网络权重，而是对原始图像进行修改。通过这种方式，神经网络对自身执行对抗攻击，通过修改原始图像来制造图像中不存在目标物体的假象。在第二阶段，神经网络以正常方式训练，以检测这张修改后的图像中的目标物体。

CmBN代表CBN的修改版本，如图4所示，定义为跨小批量归一化（CmBN）。它仅在单个批次内的小批量之间收集统计信息。

在这里插入图片描述
我们将SAM从空间注意力改为点注意力，并将PAN的快捷连接替换为拼接，如图5和图6所示。

在这里插入图片描述

3.4 YOLOv4

在本节中，我们将详细阐述YOLOv4的细节。YOLOv4由以下部分组成：

骨干网络：CSPDarknet53 [81]
颈部网络：SPP [25]，PAN [49]
头部网络：YOLOv3 [63]

YOLOv4使用了以下技术：

骨干网络的免费技巧（BoF）：CutMix和Mosaic数据增强、DropBlock正则化、类别标签平滑
骨干网络的特效技巧（BoS）：Mish激活函数、跨阶段部分连接（CSP）、多输入加权残差连接（MiWRC）
检测器的免费技巧（BoF）：CIoU损失、CmBN、DropBlock正则化、Mosaic数据增强、自对抗训练、消除网格敏感性、为单个真实值使用多个锚点、余弦退火调度器 [52]、最优超参数、随机训练形状
检测器的特效技巧（BoS）：Mish激活函数、SPP模块、SAM模块、PAN路径聚合模块、DIoU-NMS

4.实验

我们测试了不同训练改进技术对分类器在ImageNet（ILSVRC 2012验证集）数据集上准确率的影响，然后测试了这些技术对检测器在MS COCO（test-dev 2017）数据集上准确率的影响。

4.1 实验设置

在ImageNet图像分类实验中，默认的超参数设置如下：训练步数为8,000,000；批量大小和迷你批量大小分别为128和32；采用多项式衰减学习率调度策略，初始学习率为0.1；预热步数为1000；动量和权重衰减分别设置为0.9和0.005。我们所有的BoS实验都使用与默认设置相同的超参数，而在BoF实验中，我们额外增加了50%的训练步数。在BoF实验中，我们验证了MixUp、CutMix、Mosaic、模糊数据增强和标签平滑正则化方法。在BoS实验中，我们比较了LReLU、Swish和Mish激活函数的效果。所有实验均在1080 Ti或2080 Ti GPU上进行训练。

在MS COCO目标检测实验中，默认的超参数设置如下：训练步数为500,500；采用步进衰减学习率调度策略，初始学习率为0.01，并在第400,000步和第450,000步分别乘以0.1的因子；动量和权重衰减分别设置为0.9和0.0005。所有架构均使用单GPU执行多尺度训练，批量大小为64，而迷你批量大小根据架构和GPU内存限制为8或4。除了使用遗传算法进行超参数搜索实验外，所有其他实验均使用默认设置。遗传算法使用YOLOv3-SPP进行训练，采用GIoU损失，并在min-val 5k集上搜索300个epoch。我们采用搜索到的学习率0.00261、动量0.949、用于分配真实值的IoU阈值0.213和损失归一化因子0.07进行遗传算法实验。我们验证了大量的BoF方法，包括网格敏感性消除、Mosaic数据增强、IoU阈值、遗传算法、类别标签平滑、跨迷你批量归一化、自对抗训练、余弦退火调度器、动态迷你批量大小、DropBlock、优化锚点、不同类型的IoU损失。我们还对各种BoS方法进行了实验，包括Mish、SPP、SAM、RFB、BiFPN和高斯YOLO [8]。对于所有实验，我们仅使用一个GPU进行训练，因此未使用如syncBN等优化多GPU的技术。

4.2 不同特征对分类器训练的影响

首先，我们研究了不同特征对分类器训练的影响；具体来说，包括类别标签平滑的影响、不同数据增强技术的影响（如双边模糊、MixUp、CutMix和Mosaic，如图7所示），以及不同激活函数的影响（如默认的Leaky-ReLU、Swish和Mish）。

在这里插入图片描述

在我们的实验中，如表2所示，通过引入CutMix和Mosaic数据增强、类别标签平滑以及Mish激活等特征，分类器的准确率得到了提升。因此，我们用于分类器训练的BoFbackbone（Bag of Freebies）包括以下内容：CutMix和Mosaic数据增强以及类别标签平滑。此外，我们还使用Mish激活作为补充选项，如表2和表3所示。

在这里插入图片描述

4.3 不同特征对检测器训练的影响

进一步研究关注不同Bag-ofFreebies（BoF-detector）对检测器训练精度的影响，如表4所示。我们通过研究不同特征显著扩展了BoF列表，这些特征在不影响FPS的情况下提高了检测器精度：
• S：消除网格敏感性——在YOLOv3中，使用方程bx = σ(tx) + cx, by = σ(ty) + cy来评估物体坐标，其中cx和cy始终为整数，因此需要极高的tx绝对值才能使bx值接近cx或cx + 1。我们通过将sigmoid乘以超过1.0的因子来解决这个问题，从而消除物体在不可检测网格上的影响。
• M：Mosaic数据增强——在训练期间使用4图像拼接而不是单张图像。
• IT：IoU阈值——使用多个锚点进行单一真实IoU（真实值，锚点）> IoU阈值。
• GA：遗传算法——在网络训练的前10%时间段内使用遗传算法选择最优超参数。
• LS：类别标签平滑——对sigmoid激活使用类别标签平滑。
• CBN：CmBN——使用跨小批量归一化在整个批次内收集统计信息，而不是在单个小批次内收集统计信息。
• CA：余弦退火调度器——在正弦训练期间调整学习率。
• DM：动态小批量大小——通过使用随机训练形状在小分辨率训练期间自动增加小批量大小。
• OA：优化锚点——使用优化锚点进行512x512网络分辨率的训练。
• GIoU, CIoU, DIoU, MSE——使用不同的损失算法进行边界框回归。
在这里插入图片描述

进一步研究涉及不同Bagof-Specials（BoS-detector）对检测器训练精度的影响，包括PAN、RFB、SAM、Gaussian YOLO（G）和ASFF，如表5所示。在我们的实验中，当使用SPP、PAN和SAM时，检测器获得了最佳性能。

在这里插入图片描述

4.4 不同骨干网络和预训练权重对检测器训练的影响

接下来，我们研究了不同骨干模型对检测器精度的影响，如表6所示。我们注意到，分类精度最高的模型在检测器精度方面并不总是表现最佳。

首先，尽管使用不同特征训练的CSPResNeXt50模型的分类精度高于CSPDarknet53模型，但CSPDarknet53模型在目标检测方面表现出更高的精度。

其次，在CSPResNeXt50分类器训练中使用BoF和Mish提高了其分类精度，但将这些预训练权重进一步应用于检测器训练时，检测器精度反而下降。然而，在CSPDarknet53分类器训练中使用BoF和Mish不仅提高了分类器的精度，也提高了使用该分类器预训练权重的检测器的精度。最终结果表明，骨干网络CSPDarknet53比CSPResNeXt50更适合用于检测器。

我们观察到，CSPDarknet53模型通过多种改进措施，展现出更强的提升检测器精度的能力。

在这里插入图片描述

4.5 不同小批量大小对检测器训练的影响

最后，我们分析了使用不同小批量大小训练的模型所获得的结果，结果如表7所示。从表7中显示的结果可以看出，在加入BoF和BoS训练策略后，小批量大小对检测器的性能几乎没有影响。这一结果表明，在引入BoF和BoS之后，不再需要使用昂贵的GPU进行训练。换句话说，任何人都可以使用普通的GPU来训练一个优秀的检测器。

在这里插入图片描述

5.结论

与其他最先进的物体检测器获得的结果比较如图8所示。我们的YOLOv4位于帕累托最优曲线上，在速度和准确性方面均优于最快和最准确的检测器。

在这里插入图片描述
由于不同方法使用不同架构的GPU进行推理时间验证，我们在常用的Maxwell、Pascal和Volta架构的GPU上运行YOLOv4，并将其与其他最先进的方法进行比较。表8列出了使用Maxwell GPU的帧率对比结果，可以是GTX Titan X（Maxwell）或Tesla M40 GPU。表9列出了使用Pascal GPU的帧率对比结果，可以是Titan X（Pascal）、Titan Xp、GTX 1080 Ti或Tesla P100 GPU。至于表10，它列出了使用Volta GPU的帧率对比结果，可以是Titan Volta或Tesla V100 GPU。

我们提供了一款先进的检测器，其速度（FPS）和准确性（MS COCO AP50…95 和 AP50）均优于所有可用的替代检测器。所描述的检测器可以在配备8-16 GB显存的常规GPU上进行训练和使用，这使其得以广泛应用。单阶段基于锚点的检测器的原始概念已证明其可行性。我们验证了大量特征，并选择了其中一些用于提高分类器和检测器的准确性。这些特征可以作为未来研究和开发的最佳实践。

在这里插入图片描述

6.引用文献

[1] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5561–5569, 2017. 4
[2] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6154–6162, 2018. 12
[3] Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hierarchical shot detector. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9705–9714, 2019. 12
[4] Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. HarDNet: A low memory traffic network. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. 13
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834–848, 2017. 2, 4
[6] Pengguang Chen. GridMask data augmentation. arXiv preprint arXiv:2001.04086, 2020. 3
[7] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. In Advances in Neural Information Processing Systems (NeurIPS), pages 6638–6648, 2019. 2
[8] Jiwoong Choi, Dayoung Chun, Hyun Kim, and Hyuk-Jae Lee. Gaussian YOLOv3: An accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 502–511, 2019. 7
[9] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems (NIPS), pages 379–387, 2016. 2
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. 5
[11] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with CutOut. arXiv preprint arXiv:1708.04552, 2017. 3
[12] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. SpineNet: Learning scale-permuted backbone for recognition and localization. arXiv preprint arXiv:1912.05027, 2019. 2
[13] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6569–6578, 2019. 2, 12
[14] Cheng-Yang Fu, Mykhailo Shvets, and Alexander C Berg. RetinaMask: Learning to predict masks improves stateof-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353, 2019. 12
[15] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (ICLR), 2019. 3
[16] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. DropBlock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems (NIPS), pages 10727–10737, 2018. 3
[17] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 70367045, 2019. 2, 13
[18] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015. 2
[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014. 2, 4
[20] Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhaohui Yang, Han Wu, Xinghao Chen, and Chang Xu. HitDetector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2
[21] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5
[22] Bharath Hariharan, Pablo Arbel ́aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456, 2015. 4
[23] Kaiming He, Georgia Gkioxari, Piotr Dolla ́r, and Ross Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. 4
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9):1904–1916, 2015. 2, 4, 7
[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 2
[27] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. 2, 4
[28] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 4
[29] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 71327141, 2018. 4
[30] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 47004708, 2017. 2
[31] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016. 2
[32] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 6
[33] Md Amirul Islam, Shujon Naha, Mrigank Rochan, Neil Bruce, and Yang Wang. Label refinement network for coarse-to-fine semantic segmentation. arXiv preprint arXiv:1703.00551, 2017. 3
[34] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyramid network for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 234–250, 2018. 11
[35] Gu ̈nter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 971–980, 2017. 4
[36] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016. 6
[37] Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018. 2, 11
[38] Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng. CornerNet-Lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019. 2
[39] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2169–2178. IEEE, 2006. 4
[40] Youngwan Lee and Jongyoul Park. CenterMask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 12, 13
[41] Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6609–6618, 2019. 12
[42] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6054–6063, 2019. 12
[43] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. DetNet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 334–350, 2018. 2
[44] Tsung-Yi Lin, Piotr Dolla ́r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 2
[45] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ́ar. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 2, 3, 11, 13
[46] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ́ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014. 5
[47] Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 385–400, 2018. 2, 4, 11
[48] Songtao Liu, Di Huang, and Yunhong Wang. Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019. 2, 4, 13
[49] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018. 1, 2, 7
[50] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), pages 21–37, 2016. 2, 11
[51] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 4
[52] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 7
[53] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNetV2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018. 2
[54] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of International Conference on Machine Learning (ICML), volume 30, page 3, 2013. 4
[55] Diganta Misra. Mish: A self regularized nonmonotonic neural activation function. arXiv preprint arXiv:1908.08681, 2019. 4
[56] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of International Conference on Machine Learning (ICML), pages 807–814, 2010. 4
[57] Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided refinement network for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9537–9546, 2019. 12
[58] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 821–830, 2019. 2, 12
[59] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 4
[60] Abdullah Rashwan, Agastya Kalra, and Pascal Poupart. Matrix Nets: A new deep architecture for object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCV Workshop), pages 0–0, 2019. 2
[61] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779788, 2016. 2
[62] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 72637271, 2017. 2
[63] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 2, 4, 7, 11
[64] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015. 2
[65] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, 2019. 3
[66] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 2
[67] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 761–769, 2016. 3
[68] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2
[69] Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, and Yong Jae Lee. Hide-and-Seek: A data augmentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545, 2018. 3
[70] Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737, 2019. 6
[71] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. DropOut: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. 3
[72] K-K Sung and Tomaso Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 20(1):39–51, 1998. 3
[73] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 3
[74] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MNASnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2820–2828, 2019. 2
[75] Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of International Conference on Machine Learning (ICML), 2019. 2
[76] Mingxing Tan and Quoc V Le. MixNet: Mixed depthwise convolutional kernels. In Proceedings of the British Machine Vision Conference (BMVC), 2019. 5
[77] Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 4, 13
[78] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9627–9636, 2019. 2
[79] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015. 6
[80] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using DropConnect. In Proceedings of International Conference on Machine Learning (ICML), pages 1058–1066, 2013. 3
[81] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. CSPNet: A new backbone that can enhance learning capability of cnn. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPR Workshop), 2020. 2, 7
[82] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2965–2974, 2019. 12
[83] Shaoru Wang, Yongchao Gong, Junliang Xing, Lichao Huang, Chang Huang, and Weiming Hu. RDSNet: A new deep architecture for reciprocal object detection and instance segmentation. arXiv preprint arXiv:1912.05070, 2019. 13
[84] Tiancai Wang, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Learning rich features at high-speed for single-shot object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1971–1980, 2019. 11
[85] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018. 1, 2, 4
[86] Saining Xie, Ross Girshick, Piotr Dolla ́r, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1492–1500, 2017. 2
[87] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. RepPoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9657–9666, 2019. 2, 12
[88] Lewei Yao, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. SM-NAS: Structural-to-modular neural architecture search for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020. 13
[89] Zhuliang Yao, Yue Cao, Shuxin Zheng, Gao Huang, and Stephen Lin. Cross-iteration batch normalization. arXiv preprint arXiv:2002.05712, 2020. 1, 6
[90] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. UnitBox: An advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia, pages 516–520, 2016. 3
[91] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6023–6032, 2019. 3
[92] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. MixUp: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 3
[93] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7151–7160, 2018. 6
[94] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 13
[95] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4203–4212, 2018. 11
[96] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and Qixiang Ye. FreeAnchor: Learning to match anchors for visual object detection. In Advances in Neural Information Processing Systems (NeurIPS), 2019. 12
[97] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018. 2
[98] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, pages 9259–9266, 2019. 2, 4, 11
[99] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU Loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020. 3, 4
[100] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017. 3
[101] Chenchen Zhu, Fangyi Chen, Zhiqiang Shen, and Marios Savvides. Soft anchor-point object detection. arXiv preprint arXiv:1911.12448, 2019. 12
[102] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 840–849, 2019. 11