YOLOv1-v5总结

文章目录

1、yolov1
- 1.2、预测阶段
- 1.3、后处理阶段
- 1.4、模型训练阶段
- 1.5、损失函数
- 1.6、yolov1网络
- 1.7、为什么使用1*1的卷积？
2、yolov2
- 2.1、添加BN层
- 2.2、高分辨力分类器（仅仅使用分类而不是目标检测）
- 2.3、anchor机制
- 2.4、关于anchor boxes：如何确定位置
- 2.5、细粒度特征
- 2.6、多尺度输入
- 2.7、损失函数
- 2.8、为什么不能bn和dropout一起使用
- 2.8、yolov2网络
3、yolov3总结
- 3.1、网络部分
- 3.2、从特征层中获取结果
- 3.3、预测结果的解码过程
- 3.4、loss计算
- 3.5、改进1
- 3.6、改进2
- 3.7、Darknet-53总结
- 3.8、正负样本匹配
4、yolov4总结
- 4.1、网络部分
- 4.2、先验框计算过程
- 4.3、CIOU计算
yolov5总结
- 匹配正负样本
3、池化层
4、Residual残差神经网络
IOU
GIOU
DIoU原理
CIoU原理
recall(召回率)、precision(精度)
softmax

参考链接

1、yolov1

两阶段模型和单阶段模型
单阶段：不提取候选框，直接讲全图喂到算法里面，算法直接输出目标检测结果。
两阶段：先从图像提取若干候选框，在逐一对候选框分类甄别，输出结果。

在这里插入图片描述

1.2、预测阶段

表示模型训练完成，输入未知图像进行预测，输入的图像尺寸是448 * 448 * 3，输出结果为7 * 7 * 30的张量，最终预测结果就包含在张量中，需要通过nms过滤掉一些重复的框。

7 * 7 * 30 张量组成是yolov1将输入图片划分为s*s个网格（s=7），
每个网格预测b个bounding box（b=2）这两个预测框中心点都落在这个网络里，
每个网格输出4个位置信息（x，y，w，h），以及c个类别的概率信息（c=20），
假设已经包含了物体，他是某一个物体的概率：
将bounding box的置信度 * 类别的条件概率 = 类别概率。

排列顺序：x，y，w，h，置信度，x，y，w，h，置信度，+20个类别概率（真实概率=置信度*类别概率）

x,y：代表了预测的bounding box的中心点与网格的偏移值。
w,h：代表了预测的bounding box的width、height相对于整幅图像width,height的比例。
置信度：若bounding box包含物体，则P(object) = 1；否则P(object) = 0。

1.3、后处理阶段

在这里插入图片描述

将模型输出的7730张量进行结果过滤，过滤掉不是最大值的框，此时98个长度为20的向量(表示20个类别的概率)，对第一个类别进行处理，先将类别概率 * 置信度得到这个类的全概率，将98个里面所有全概率小于置信度阈值的直接抹0，然后进行从大到小排序，然后将第一大的和第二大的计算IOU交并比，如果IOU交并比大于设置的IOU阈值，表示重复预测了同一个物体，将概率小的抹0，依次轮询，将第三、四直到最后一个，然后从第二大的开始和第三的的进行计算IOU，依次轮询上面过程，直至遍历完20个类别。

1.4、模型训练阶段

请添加图片描述

将图片划分为7 * 7个网格，每个网格预测2个bounding box，标注框的中心的落在哪个grid cell中就有这个grid cell中与标注框IOU交并比大的去进行预测，如果没有标注框落在这个grid cell中，他预测出来的两个框都不作用，只需要让他的置信度越小越好，什么都不需要做。

1.5、损失函数

在这里插入图片描述

损失函数设计：负责拟合的这个框，和这个物体的标注框，尽可能一致，是损失函数最小化。训练阶段是一个监督学习，是一个让损失函数最小化的过程，需要把预测结果和人工标注做比对，让差距越来越小。
全部采用了sum-squared error loss来做这个事情

1、负责检测物体的bounding box的中心点定位误差：
实际上就是负责检测物体的标注框和标注框的宽高尽可能一致，使用的是残差平方和。

2、负责检测物体的bounding box的宽高误差：
负责检测物体的框尽量跟标注框的一致，使用的残差平方和，加根号是为了让小物体根敏感，同等偏差小物体造成的损失更大。

3、负责预测物体的bounding box置信度误差：
标签值就是 这个bounding box和标注框的IOU，预测值越接近这个值越好。

4、不负责检测物体的置信度误差：
所有不负责检测物体的bounding box，他们预测结果最后都为0，越接近0越好。

5、负责检测物体的grid cell的分类误差：
比如说这个grid cell负责预测狗，他的20个条件类别概率，狗的越接近1越好
对每一个类别遍历求和，标注框的类别越接近1越好，均方误差
只对那些有真实物体所属的格点进行物体类别损失计算，若该格点不包含物体则不进行此项损失计算

1.6、yolov1网络

YOLO网络借鉴了GoogLeNet分类网络结构，有24个卷积层+2个全连接层。在ImageNet中预训练网络时，使用的输入是224 * 224，用于检测任务时，输入大小改为448 * 448，这是通过调整第一个卷积层的步长来实现的；

在这里插入图片描述

卷积和池化计算
W：为输入图像大小。F：为卷积大小。P：为填充大小。S：为步长。
卷积计算公式：(W-F+2P)/S+1
池化计算公式：(W-F)/S+1
一般而言：
F=3时，P=1
F=5时，P=2
F=7时，P=3

1、输入图像是448*448*3图像
进行第一层卷积操作7*7*64步长为2自适应填充3
img_x = (448-7+2*3)/2+1 = 224*224*64
进行池化操作2*2步长为2
img_x = (224-2)/2+1 = 112*112*64

2、第二层卷积操作
将112*112*64进行3*3*192卷积操作步长是1自适应填充为1
img_x = (112-3+2*1)/1+1 = 112*112*192
进行池化操作2*2步长2
img_x = (112-2)/2+1 = 56*56*192

3、第三层卷积操作
将56*56*192进行1*1*128卷积操作步长为1
img_x = (56-1+0)/1+1 = 56*56*128
继续3*3*256卷积操作步长为1自适应填充为1
img_x = (56-3+2)/1+1 = 56*56*256
继续进行1*1*512卷积
img_x = (56-1+0)/1+1 = 56*56*512
池化操作2*2步长为2
img_x = (56-2)/2+1 = 28*28*512

4、第四层卷积-第六层

5、输出并进行最后两层全连接：输出为：7 * 7 * 30，验证正确！
负责将卷积输出的二维特征图转化成一维的一个向量

由于YOLOV1的框架设计，该网络存在以下缺点：
每个网格只对应两个bounding box，当物体的长宽比不常见(也就是训练数据集覆盖不到时)，效果较差。

原始图片只划分为7x7的网格，当两个物体靠的很近时，效果比较差。
最终每个网格只对应一个类别，容易出现漏检(物体没有被识别到)。
对于图片中比较小的物体，效果比较差。这其实是所有目标检测算法的通病。

1.7、为什么使用1*1的卷积？

降低/提升通道数、增加非线性（利用后接的非线性激活函数，把网络做的很deep）

2、yolov2

主要是在yolov1的版本上进行了如下改动，让精度提升：VOC2007 资料集上的mAP 由 63.4% 提升到 78.6%

2.1、添加BN层

对数据进行数据的归一化处理，以0为中心，0为均值，1为标准差的分布，为什么进行这样子操作是因为很多激活函数0附近是非饱和区，输出太大或者太小就会陷入饱和区，难以训练。提升模型收敛速度，而且可以起到一定正则化效果，降低模型的过拟合，一般出现在激活函数的前面，加快收敛，改善梯度（远离饱和区）、对初始化不敏感、正则化作用（防止过拟合）

训练阶段：每次喂进去32张图片，某层的某个神经元会输出32个响应值，对这32个响应值求均值、标准差、在做归一化（标准化：（把神经元的32个输出-32个值的均值）/32个值的标准差）
把标准化的响应*伽马+贝塔（每一组神经元都需要训练一组伽马和贝塔），把这个神经元的输出限制到了以0为均值，1为标准差的分布了
测试阶段：
均值就用训练阶段很多batch均值的期望作为最终均值，方差就用训练阶段很多batch的方差做一个无偏估计

在这里插入图片描述

2.2、高分辨力分类器（仅仅使用分类而不是目标检测）

改进前：
使用Image Net原始的224 * 224图片，来训练YOLO V1的特征提取网络
然后使用448 *448的目标检测的图片，进一步训练目标识别网络

改进后：
先把Image Net原始的224 * 224图片，resize到448 * 448大小上。
然后用resize之后的448 * 448的图片集，训练YOLO V2的骨干网络。
最后使用448 *448的目标检测的图片，进一步fine tunning骨干网络和训练目标识别。
这样的改动，提升了性能，mAP直接提升3.5%

2.3、anchor机制

使用了anchor boxes去预测，预测框只需要预测，他相对anchor的偏移量，去掉了最后的全连接层，网络仅采用了卷积层和池化层，
在YOLOv1中，输入图片最终被划分为7x7的gird cell，每个单元格预测2个边界框。YOLOv1最后采用的是全连接层直接对边界框进行预测，其中预测框的宽与高是相对整张图片大小的，而由于各个图片中存在不同尺度和长宽比（scales and ratios）的物体，YOLOv1在训练过程中学习适应不同物体的形状是比较困难的，这也导致YOLOv1在精确定位方面表现较差。YOLOv2则引入了一个anchor boxes的概念，这样做的目的就是得到更高的召回率，yolov1只有98个边界框，yolov2可以达到1000多个（论文中的实现是845个）。还去除了全连接层，保留一定空间结构信息，网络仅由卷积层和池化层构成。输入由448x448变为416x416，下采样32倍，输出为13x13x5x25。采用奇数的gird cell 是因为大图像的中心往往位于图像中间，为了避免四个gird cell参与预测，我们更希望用一个gird cell去预测。结果mAP由69.5下降到69.2，下降了0.3，召回率由81%提升到88%，提升7%。尽管mAP下降，但召回率的上升意味着我们的模型有更大的提升空间。

yolov2中将图片划分成了13*13个grid cell，每个grid cell预测5个anchor(实现指定5种长宽比不同的5种先验框)，每一个anchor对应一个预测框，这个预测框只需要预测，他相对anchor的偏移量。仍然是和标注框的IOU大就负责预测物体。

输出：13 * 13 * 25（5+20） * 5（anchor），一个grid cell产生125个数。
在这里插入图片描述
输入图片-》输出张量（结果包含在张量里面）
yolov2使用的是darknet-19作为主干网络，选择anchor的时候可以使用聚类进行计算

2.4、关于anchor boxes：如何确定位置

在yolov1中，每个grid cell有两个bounding box，bounding box的中心坐标由它与grid cell的中心的偏移量确定，但因为偏移量可能很大，所以bounding box的中心不一定被限制在grid cell中，也就是它属于grid cell1，但它的中心有可能落在grid cell2中。所以yolov2加入了sigmoid函数将bounding box的中心限制在它所属的grid cell中。
也就是在yolov1中，grid cell对于bounding box来说，只是一个位置坐标的参考系，在yolov2中，才成为位置的限制基准。让模型在训练初期更稳定。

长度与宽度：Bounding Box长度和宽度是相对于Anchor中心点的offset，而不是相对于起点0的offset，也不是相对于Grid Cell的offset。且经过指数放大，目的是使长度和宽度更加的敏感。

2.5、细粒度特征

去捕捉更细粒度的特征
YOLOv2提出了一种passthrough层来利用更精细的特征图

感受野反应了特征图上的一个点，能够感受到原始图像上像素点区域的大小，
越是接近输入端，特征图上的一个点的感受野越小，局部信息越多，宏观信息越少。
越是远离输入端，特征图上的一个点的感受野越大，局部信息越少，宏观信息越多。
YOLO V1的定位与分类，是基于最后的特征输出的信息，因此YOLO对宏观信息把控比较好，对局部的微观信息容易忽略，导致YOLO V1对小目标识别不是太理想，经常检测不到小目标。为了能够克服YOLO V1对小目标识别能力弱这个特点，YOLO V2在在Darknet网络的基础之上增加了一个细粒度特征提取的功能, 并通过pass through来实现的。如下图所示：目标定位和分类的输入：

高层的粗粒度的语义特征（适合大目标）
底层的细粒度的像素特征（适合小目标）

这样的网络设计，同时兼顾大目标和小目标的检测。

在这里插入图片描述

2.6、多尺度输入

由于没有了全连接层，只有卷积层和池化层，YOLO V2除了支持标准的416 * 416 * 3尺度外，还支持其他尺度的输入。为了增强模型的鲁棒性，YOLOv2采用了多尺度输入训练策略，具体来说就是在训练过程中每间隔一定的iterations之后改变模型的输入图片大小。由于YOLOv2的下采样总步长为32，输入图片大小选择一系列为32倍数的值：{320,352,384,…,608}，输入图片最小为320x320，此时对应的特征图大小为10x10（不是奇数了，确实有点尴尬），而输入图片最大为 608x608，对应的特征图大小为19x19。在训练过程，每隔10个iterations随机选择一种输入图片大小，然后只需要修改对最后检测层的处理就可以重新训练。采用Multi-Scale Training策略，YOLOv2可以适应不同大小的图片，并且预测出很好的结果。

为什么全连接层要固定输入尺寸？
主要原因是权重矩阵的维度需要提前确定。在全连接层中，每个神经元都连接到上一层的所有神经元，这意味着每个神经元都有一个与之对应的权重。如果输入尺寸在每次训练时都不同，那么每个训练样本都需要一个新的权重矩阵。这增加了计算复杂度，并且无法共享权重参数，降低了模型的效率和泛化能力。

2.7、损失函数

遍历了所有的预测框，
损失函数1：这个anchor与标注框的IOU是否小于0.6
损失函数2：是否是模型预测的12800次（训练早期，anchor与预测框越接近越好，让模型更快的学会预测anchor的位置，让坐标更稳定）
损失函数3(负责检测物体)：这个anchor负责检测物体
分类误差：标注框的类别和预测框的类别做差在平方，这里是20个类别，标注框他们生产20维的向量，里面有一个为1其他为0，预测框他也会生产20维度的向量，里面包含了他对20个类别的概率，逐元素做差然后平方在求和

3类anchor：
1、与标注框IOU最大的，负责预测物体。
2、与标注框IOU小于0.6，被抛弃。
3、与标注框IOU大于0.6但是不是最大的，忽略了他们的损失。

在这里插入图片描述

2.8、为什么不能bn和dropout一起使用

dropout解释：指的就是在神经网络的训练过程中提出的一种防止过拟合的策略。
策略旨在训练过程中按照一定的概率（一般情况下：隐藏层采样概率为0.5，输入层采样概率为0.8）随机删除网络中的神经元（输出层除外）。

BN 层会计算批量数据的均值和方差,并利用这些统计量来对数据进行归一化。而 Dropout 会随机丢弃部分神经元,改变了数据的分布。这会导致 BN 层无法正确估计数据的统计特性,从而影响其归一化效果

bounding box和anchor box的区别？
Bounding box（边界框）是用来描述目标在图像中位置和范围的矩形框。它由矩形框的左上角和右下角坐标定义，可以用来标记和定位目标物体。在目标检测任务中，模型通过预测目标物体的边界框来实现目标检测和定位。

Anchor box（锚框）是在目标检测中使用的一种预定义的边界框。它是在输入图像中按照一定的规则生成的，通常是通过在图像上采样一组不同尺寸和宽高比的矩形框。锚框用于提供候选框的建议，以便模型可以预测目标物体的位置和类别。在训练过程中，模型会根据锚框与实际目标之间的匹配程度来进行边界框回归和目标分类。

模型在训练过程中通过与Anchor box的匹配来预测和调整Bounding box的位置和尺寸。
总的来说，Bounding box是目标物体的真实边界框，而Anchor box是用于生成候选框并指导模型预测目标边界框的预定义框。

2.8、yolov2网络

YOLOv2使用Darknet-19网络，有19个卷积层和5个最大池化层。相比YOLOv1的24个卷积层和2个全连接层精简了网络。

3、yolov3总结

参考链接

3.1、网络部分

yolov3中使用darknet53网络作为主干网络进行体征提取，网络中主要使用了残差连接(ResNet结构)，有效缓解深度网络中的梯度消失问题,使得训练更加稳定，残差网络可以构建更加深层的网络结构,从而提升特征提取的能力。

在这里插入图片描述

Darknet53中的残差卷积，首先进行一次卷积核大小为3X3、步长为2的卷积，该卷积会压缩输入进来的特征层的宽和高，此时我们可以获得一个特征层，我们将该特征层命名为layer。之后我们再对该特征层进行一次1X1的卷积和一次3X3的卷积，并把这个结果加上layer，此时我们便构成了残差结构。

在这里插入图片描述

睿智的目标检测26——Pytorch搭建yolo3目标检测平台

import math
from collections import OrderedDict

import torch.nn as nn


#---------------------------------------------------------------------#
#   残差结构
#   利用一个1x1卷积下降通道数，然后利用一个3x3卷积提取特征并且上升通道数
#   最后接上一个残差边
#---------------------------------------------------------------------#
class BasicBlock(nn.Module):
    def __init__(self, inplanes, planes):
        super(BasicBlock, self).__init__()
        self.conv1  = nn.Conv2d(inplanes, planes[0], kernel_size=1, stride=1, padding=0, bias=False)
        self.bn1    = nn.BatchNorm2d(planes[0])
        self.relu1  = nn.LeakyReLU(0.1)
        
        self.conv2  = nn.Conv2d(planes[0], planes[1], kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2    = nn.BatchNorm2d(planes[1])
        self.relu2  = nn.LeakyReLU(0.1)

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu2(out)

        out += residual
        return out

class DarkNet(nn.Module):
    def __init__(self, layers):
        super(DarkNet, self).__init__()
        self.inplanes = 32
        # 416,416,3 -> 416,416,32
        self.conv1  = nn.Conv2d(3, self.inplanes, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1    = nn.BatchNorm2d(self.inplanes)
        self.relu1  = nn.LeakyReLU(0.1)

        # 416,416,32 -> 208,208,64
        self.layer1 = self._make_layer([32, 64], layers[0])
        # 208,208,64 -> 104,104,128
        self.layer2 = self._make_layer([64, 128], layers[1])
        # 104,104,128 -> 52,52,256
        self.layer3 = self._make_layer([128, 256], layers[2])
        # 52,52,256 -> 26,26,512
        self.layer4 = self._make_layer([256, 512], layers[3])
        # 26,26,512 -> 13,13,1024
        self.layer5 = self._make_layer([512, 1024], layers[4])

        self.layers_out_filters = [64, 128, 256, 512, 1024]

        # 进行权值初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    #---------------------------------------------------------------------#
    #   在每一个layer里面，首先利用一个步长为2的3x3卷积进行下采样
    #   然后进行残差结构的堆叠
    #---------------------------------------------------------------------#
    def _make_layer(self, planes, blocks):
        layers = []
        # 下采样，步长为2，卷积核大小为3
        layers.append(("ds_conv", nn.Conv2d(self.inplanes, planes[1], kernel_size=3, stride=2, padding=1, bias=False)))
        layers.append(("ds_bn", nn.BatchNorm2d(planes[1])))
        layers.append(("ds_relu", nn.LeakyReLU(0.1)))
        # 加入残差结构
        self.inplanes = planes[1]
        for i in range(0, blocks):
            layers.append(("residual_{}".format(i), BasicBlock(self.inplanes, planes)))
        return nn.Sequential(OrderedDict(layers))

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)

        x = self.layer1(x)
        x = self.layer2(x)
        out3 = self.layer3(x)
        out4 = self.layer4(out3)
        out5 = self.layer5(out4)

        return out3, out4, out5

def darknet53():
    model = DarkNet([1, 2, 8, 8, 4])
    return model

3.2、从特征层中获取结果

从特征获取预测结果的过程可以分为两个部分，分别是：
构建FPN特征金字塔进行加强特征提取。
利用Yolo Head对三个有效特征层进行预测。

在特征利用部分，YoloV3提取多特征层进行目标检测，一共提取三个特征层。
三个特征层位于主干部分Darknet53的不同位置，分别位于中间层，中下层，底层，三个特征层的shape分别为(52,52,256)、(26,26,512)、(13,13,1024)。处理完后利用YoloHead获得预测结果。

from collections import OrderedDict

import torch
import torch.nn as nn

from nets.darknet import darknet53


def conv2d(filter_in, filter_out, kernel_size):
    pad = (kernel_size - 1) // 2 if kernel_size else 0
    return nn.Sequential(OrderedDict([
        ("conv", nn.Conv2d(filter_in, filter_out, kernel_size=kernel_size, stride=1, padding=pad, bias=False)),
        ("bn", nn.BatchNorm2d(filter_out)),
        ("relu", nn.LeakyReLU(0.1)),
    ]))


# ------------------------------------------------------------------------#
#   make_last_layers里面一共有七个卷积，前五个用于提取特征。
#   后两个用于获得yolo网络的预测结果
#   [512, 1024]，1024，75
#   [256, 512]，768，75
#   [128, 256]，384，75
# ------------------------------------------------------------------------#
def make_last_layers(filters_list, in_filters, out_filter):
    m = nn.Sequential(
        conv2d(in_filters, filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        nn.Conv2d(filters_list[1], out_filter, kernel_size=1, stride=1, padding=0, bias=True)
    )
    return m


class YoloBody(nn.Module):
    def __init__(self, anchors_mask, num_classes, pretrained=False):
        super(YoloBody, self).__init__()
        # ---------------------------------------------------#
        #   生成darknet53的主干模型
        #   获得三个有效特征层，他们的shape分别是：
        #   52,52,256
        #   26,26,512
        #   13,13,1024
        # ---------------------------------------------------#
        self.backbone = darknet53()
        if pretrained:
            self.backbone.load_state_dict(torch.load("model_data/darknet53_backbone_weights.pth"))

        # ---------------------------------------------------#
        #   out_filters : [64, 128, 256, 512, 1024]
        # ---------------------------------------------------#
        out_filters = self.backbone.layers_out_filters

        # ------------------------------------------------------------------------#
        #   计算yolo_head的输出通道数，对于voc数据集而言
        #   final_out_filter0 = final_out_filter1 = final_out_filter2 = 75
        # ------------------------------------------------------------------------#
        # 输出的第一个特征层(13*13*75)
        self.last_layer0 = make_last_layers([512, 1024], out_filters[-1], len(anchors_mask[0]) * (num_classes + 5))

        # 输出的第二个特征层(26*26*75)
        self.last_layer1_conv = conv2d(512, 256, 1)
        self.last_layer1_upsample = nn.Upsample(scale_factor=2, mode='nearest')
        self.last_layer1 = make_last_layers([256, 512], out_filters[-2] + 256, len(anchors_mask[1]) * (num_classes + 5))

        # 输出的第三个特征层(52*52*75)
        self.last_layer2_conv = conv2d(256, 128, 1)
        self.last_layer2_upsample = nn.Upsample(scale_factor=2, mode='nearest')
        self.last_layer2 = make_last_layers([128, 256], out_filters[-3] + 128, len(anchors_mask[2]) * (num_classes + 5))

    def forward(self, x):
        # ---------------------------------------------------#
        #   获得三个有效特征层，他们的shape分别是：
        #   52,52,256；26,26,512；13,13,1024
        # ---------------------------------------------------#
        x2, x1, x0 = self.backbone(x)

        # ---------------------------------------------------#
        #   第一个特征层
        #   out0 = (batch_size,255,13,13)
        # ---------------------------------------------------#
        # 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512
        # 选取last_layer0前五个层
        out0_branch = self.last_layer0[:5](x0)
        # 第一个层的输出（last_layer0第五层之后）
        out0 = self.last_layer0[5:](out0_branch)

        # 13,13,512 -> 13,13,256 -> 26,26,256
        # 将上一层保存的，先进行1*1的卷积改变通道数，然后进行上采样
        x1_in = self.last_layer1_conv(out0_branch)
        x1_in = self.last_layer1_upsample(x1_in)

        # 26,26,256 + 26,26,512 -> 26,26,768
        x1_in = torch.cat([x1_in, x1], 1)
        # ---------------------------------------------------#
        #   第二个特征层
        #   out1 = (batch_size,255,26,26)
        # ---------------------------------------------------#
        # 26,26,768 -> 26,26,256 -> 26,26,512 -> 26,26,256 -> 26,26,512 -> 26,26,256
        out1_branch = self.last_layer1[:5](x1_in)
        out1 = self.last_layer1[5:](out1_branch)

        # 26,26,256 -> 26,26,128 -> 52,52,128
        x2_in = self.last_layer2_conv(out1_branch)
        x2_in = self.last_layer2_upsample(x2_in)

        # 52,52,128 + 52,52,256 -> 52,52,384
        x2_in = torch.cat([x2_in, x2], 1)
        # ---------------------------------------------------#
        #   第一个特征层
        #   out3 = (batch_size,255,52,52)
        # ---------------------------------------------------#
        # 52,52,384 -> 52,52,128 -> 52,52,256 -> 52,52,128 -> 52,52,256 -> 52,52,128
        out2 = self.last_layer2(x2_in)
        return out0, out1, out2


anchor = [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]]
if __name__ == '__main__':
    YoloBody(anchor, 20)

3.3、预测结果的解码过程

由第二步我们可以获得三个特征层的预测结果，shape分别为：
(N,13,13,255)
(N,26,26,255)
(N,52,52,255)
在这里我们简单了解一下每个有效特征层到底做了什么：
每一个有效特征层将整个图片分成与其长宽对应的网格，如(N,13,13,255)的特征层就是将整个图像分成13x13个网格；然后从每个网格中心建立多个先验框，这些框是网络预先设定好的框，网络的预测结果会判断这些框内是否包含物体，以及这个物体的种类。
由于每一个网格点都具有三个先验框，所以上述的预测结果可以reshape为：
(N,13,13,3,85)
(N,26,26,3,85)
(N,52,52,3,85)

其中的85可以拆分为4+1+80，其中的4代表先验框的调整参数，1代表先验框内是否包含物体，80代表的是这个先验框的种类，由于coco分了80类，所以这里是80。如果YoloV3只检测两类物体，那么这个85就变为了4+1+2 = 7。
即85包含了4+1+80，分别代表x_offset、y_offset、h和w、置信度、分类结果。
但是这个预测结果并不对应着最终的预测框在图片上的位置，还需要解码才可以完成。

YoloV3的解码过程分为两步：
先将每个网格点加上它对应的x_offset和y_offset，加完后的结果就是预测框的中心。
然后再利用先验框和h、w结合计算出预测框的宽高。这样就能得到整个预测框的位置了。

得到最终的预测结果后还要进行得分排序与非极大抑制筛选。
这一部分基本上是所有目标检测通用的部分。其对于每一个类进行判别：
1、取出每一类得分大于self.obj_threshold的框和得分。
2、利用框的位置和得分进行非极大抑制。

在这里插入图片描述

代码流程
预测结果解码：

1、将三个特征层输出的张量进行解析，算出当前层的stride_h、w。
2、算出anchor相对特征层的真实宽高。
3、将输出的张量由batch_size, 255, 13, 13转化成batch_size, 3, 13, 13, 85。
4、取出输出结果中的中心坐标x,y,w,h,conf,cls。
5、构建13*13网络，又两个torch.Size([1, 3, 13, 13])张量组成进行表示。
6、取出上面计算相对特征成的anchor宽高，由两个[1, 3, 13, 13]张量组成进行表示。
7、根据公式计算预测框的中心坐标以及宽高。
8、坐标需要除以特征层13，输出所有预测框信息([1, 10647, 85])。

nms非极大值抑制：

1、将原来的中心x，y，w，h转换成left x,y right x,y。
2、将每一个框中最大种类置信度和最大种类置信度索引。
3、小于置信度阈值的全部筛掉。
4、然后拼接7的内容为：x1, y1, x2, y2, obj_conf, class_conf, class_pred
5、进行nms非极大值抑制
6、将结果转换为中心坐标 (x, y),宽高
7、将结果乘上原图image_shape得到最终的结果

import torch
import torch.nn as nn
from torchvision.ops import nms
import numpy as np

"""
Configurations:
----------------------------------------------------------------------
|                     keys |                                   values|
----------------------------------------------------------------------
|               model_path |              model_data/yolo_weights.pth|
|             classes_path |              model_data/coco_classes.txt|
|             anchors_path |              model_data/yolo_anchors.txt|
|             anchors_mask |        [[6, 7, 8], [3, 4, 5], [0, 1, 2]]|
|              input_shape |                               [416, 416]|
|               confidence |                                      0.5|
|                  nms_iou |                                      0.3|
|          letterbox_image |                                    False|
|                     cuda |                                     True
"""


class DecodeBox():
    def __init__(self, anchors, num_classes, input_shape, anchors_mask=[[6, 7, 8], [3, 4, 5], [0, 1, 2]]):
        super(DecodeBox, self).__init__()
        self.anchors = anchors  # anchors坐标
        self.num_classes = num_classes  # 类别数
        self.bbox_attrs = 5 + num_classes  # 每一个anchor需要的数5+80
        self.input_shape = input_shape  # 输入图像[416,416]
        # -----------------------------------------------------------#
        #   13x13的特征层对应的anchor是[116,90],[156,198],[373,326]
        #   26x26的特征层对应的anchor是[30,61],[62,45],[59,119]
        #   52x52的特征层对应的anchor是[10,13],[16,30],[33,23]
        # -----------------------------------------------------------#

        # [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
        self.anchors_mask = anchors_mask
        print("__init__anchors_mask:", anchors_mask)

    def decode_box(self, inputs):
        outputs = []
        for i, input in enumerate(inputs):
            # -----------------------------------------------#
            #   输入的input一共有三个，他们的shape分别是
            #   batch_size, 255, 13, 13
            #   batch_size, 255, 26, 26
            #   batch_size, 255, 52, 52
            # -----------------------------------------------#
            batch_size = input.size(0)
            input_height = input.size(2)
            input_width = input.size(3)

            # 1 13 13
            print("input:", batch_size, input_height, input_width)
            # -----------------------------------------------#
            #   输入为416x416时
            #   stride_h = stride_w = 32、16、8
            # -----------------------------------------------#
            stride_h = self.input_shape[0] / input_height
            stride_w = self.input_shape[1] / input_width
            # 32.0 32.0
            print("stride_w:stride_h:", stride_w, stride_h)

            # -------------------------------------------------#
            #   此时获得的scaled_anchors大小是相对于特征层的
            # -------------------------------------------------#
            scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in
                              self.anchors[self.anchors_mask[i]]]

            # scaled_anchors: [(3.625, 2.8125), (4.875, 6.1875), (11.65625, 10.1875)]
            print("scaled_anchors:", scaled_anchors)

            # -----------------------------------------------#
            #   输入的input一共有三个，他们的shape分别是
            #   batch_size, 3, 13, 13, 85
            #   batch_size, 3, 26, 26, 85
            #   batch_size, 3, 52, 52, 85
            # -----------------------------------------------#
            # 将原本的bs，3*(5+num_classes),13,13 转换成-> bs,3,13,13,(5+numclasses)
            prediction = input.view(batch_size, len(self.anchors_mask[i]),
                                    self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()

            # -----------------------------------------------#
            #   先验框的中心位置的调整参数
            #   ... 只选择最后一个维度
            #   [..., 0]选择最后一维的第0个元素，x
            # -----------------------------------------------#
            x = torch.sigmoid(prediction[..., 0])
            y = torch.sigmoid(prediction[..., 1])
            # torch.Size([1, 3, 13, 13])
            # print("x:,y:", x, y, x.shape, y.shape)
            # -----------------------------------------------#
            #   先验框的宽高调整参数
            # -----------------------------------------------#
            w = prediction[..., 2]
            h = prediction[..., 3]

            # torch.Size([1, 3, 13, 13])
            # print("w,h:", w, h, w.shape, h.shape)
            # -----------------------------------------------#
            #   获得置信度，是否有物体
            # -----------------------------------------------#
            conf = torch.sigmoid(prediction[..., 4])
            # -----------------------------------------------#
            #   所有种类置信度
            # -----------------------------------------------#
            pred_cls = torch.sigmoid(prediction[..., 5:])

            # 根据x是否在GPU上选择合适的张量，32位
            FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
            # 根据x是否在GPU上选择合适的张量，64位
            LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

            # ----------------------------------------------------------#
            #   生成网格，先验框中心，网格左上角 
            #   batch_size,3,13,13
            # ----------------------------------------------------------#
            # linspace创建等间隔的 1D 张量的函数
            # 0开始 -> 12，生产序列长度13，
            # 沿着行方向(第0维)重复 input_height 次,沿着列方向(第1维)重复1次，得到input_height * input_width张量
            # 得到一个 (batch_size * len(self.anchors_mask[i])) * input_height * input_width 的 3D 张量
            # torch.Size([1, 3, 13, 13]) =》 0，1，2，。。。12，每一行都是这样子
            grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_height, 1).repeat(
                batch_size * len(self.anchors_mask[i]), 1, 1).view(x.shape).type(FloatTensor)

            # torch.Size([1, 3, 13, 13]) =》
            # 0 每一列都是这样子
            # 1
            # 2
            # 。。。
            # 12
            grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_width, 1).t().repeat(
                batch_size * len(self.anchors_mask[i]), 1, 1).view(y.shape).type(FloatTensor)

            # ----------------------------------------------------------#
            #   按照网格格式生成先验框的宽高
            #   batch_size,3,13,13
            # ----------------------------------------------------------#
            # 选择从scaled_anchors 这个 2D 张量中选择第 0 列,也就是 anchor box 的宽度
            anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
            # 选择从scaled_anchors 这个 2D 张量中选择第 1 列,也就是 anchor box 的高度
            anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
            # anchor_w: anchor_h tensor([[3.6250],[4.8750],[11.6562]], device='cuda:0') tensor([[2.8125],[6.1875],[10.1875]], device='cuda:0')
            # scaled_anchors: [(3.625, 2.8125), (4.875, 6.1875), (11.65625, 10.1875)]
            print("anchor_w:anchor_h", anchor_w, anchor_h)

            # 将 anchor_w 张量沿着 batch 维度(第 0 维)重复 batch_size 次
            # 结果是一个 (batch_size, 1, input_height * input_width) 的张量 第一个169全部是3.625，第二个4.8750，第三个11.6562
            # 将[1, 3, 169]调整为([1, 3, 13, 13])
            anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
            anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)
            # ([1, 3, 13, 13])
            print("anchor_w:anchor_h", anchor_w, anchor_h)
            # ----------------------------------------------------------#
            #   利用预测结果对先验框进行调整
            #   首先调整先验框的中心，从先验框中心向右下角偏移
            #   再调整先验框的宽高。
            # ----------------------------------------------------------#
            # 预测框坐标回归
            # sigmoid(x) + grid_x
            # anchor_w * exp(w)

            # [1,3,13,13,4]
            pred_boxes = FloatTensor(prediction[..., :4].shape)
            pred_boxes[..., 0] = x.data + grid_x
            pred_boxes[..., 1] = y.data + grid_y
            pred_boxes[..., 2] = torch.exp(w.data) * anchor_w
            pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

            # ----------------------------------------------------------#
            #   将输出结果归一化成小数的形式
            # ----------------------------------------------------------#
            # tensor([13., 13., 13., 13.], device='cuda:0')
            # torch.Size([4])
            _scale = torch.Tensor([input_width, input_height, input_width, input_height]).type(FloatTensor)

            # torch.Size([1, 507, 85])  => 13*13*3
            # torch.Size([1, 2028, 85]) => 26*26*3
            # torch.Size([1, 8112, 85]) => 52*52*3

            # pred_boxes.view(batch_size, -1, 4)
            # / _scale这一步是为了将归一化坐标转换回原始图像坐标系
            output = torch.cat((pred_boxes.view(batch_size, -1, 4) / _scale,
                                conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
            outputs.append(output.data)
        return outputs

    def yolo_correct_boxes(self, box_xy, box_wh, input_shape, image_shape, letterbox_image):
        # box_xy,box_wh [12,2],中心点坐标，宽高
        # input_shape [416,416]
        # image_shape [1300,1300]
        # letterbox_image false
        # -----------------------------------------------------------------#
        #   把y轴放前面是因为方便预测框和图像的宽高进行相乘
        # -----------------------------------------------------------------#
        # ::-1也就是从后向前)的方式提取元素
        box_yx = box_xy[..., ::-1]
        box_hw = box_wh[..., ::-1]
        input_shape = np.array(input_shape)
        image_shape = np.array(image_shape)

        if letterbox_image:
            # -----------------------------------------------------------------#
            #   这里求出来的offset是图像有效区域相对于图像左上角的偏移情况
            #   new_shape指的是宽高缩放情况
            # -----------------------------------------------------------------#
            new_shape = np.round(image_shape * np.min(input_shape / image_shape))
            offset = (input_shape - new_shape) / 2. / input_shape
            scale = input_shape / new_shape

            box_yx = (box_yx - offset) * scale
            box_hw *= scale
        # 左上角坐标 = 中心坐标 - 宽高的一半 [12,2]
        box_mins = box_yx - (box_hw / 2.)
        # 右下角坐标 box_maxes [12,2]
        box_maxes = box_yx + (box_hw / 2.)

        # 获取box_mins => x,y坐标
        # 获取box_maxes => x,y坐标
        # 将上面4个切片沿最后一个维度进行拼接,得到一个新的数组 boxes [12,4]
        boxes = np.concatenate([box_mins[..., 0:1], box_mins[..., 1:2], box_maxes[..., 0:1], box_maxes[..., 1:2]],
                               axis=-1)
        print("boxes aaa:", boxes, boxes.shape)
        # boxes 中的坐标值乘以图像尺寸,目的是将相对坐标转换为绝对坐标。[1300 1300 1300 1300] (4,)
        boxes *= np.concatenate([image_shape, image_shape], axis=-1)
        print("boxes bbb:", boxes, boxes.shape)
        return boxes

    def non_max_suppression(self, prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5,
                            nms_thres=0.4):

        # torch.Size([1, 507, 85])  => 13*13*3
        # torch.Size([1, 2028, 85]) => 26*26*3
        # torch.Size([1, 8112, 85]) => 52*52*3
        # prediction =》([1, 10647, 85])
        # num_classes = 80
        # input_shape =》 416，416
        # image_shape =》 1300*1300
        # letterbox_image =》false
        # conf_thres = 0.5，nms_thres=0.4

        # ----------------------------------------------------------#
        #   将预测结果的格式转换成左上角右下角的格式。
        #   prediction  [batch_size, num_anchors, 85]
        # ----------------------------------------------------------#
        # 创建一个新的([1, 10647, 85])
        box_corner = prediction.new(prediction.shape)
        # 由原来的中心x，y，w，h转换成left x,y right x,y
        box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2
        box_corner[:, :, 1] = prediction[:, :, 1] - prediction[:, :, 3] / 2
        box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2
        box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2
        prediction[:, :, :4] = box_corner[:, :, :4]

        output = [None for _ in range(len(prediction))]
        print(len(prediction))
        for i, image_pred in enumerate(prediction):
            # ----------------------------------------------------------#
            #   对所有种类预测部分取max。
            #   class_conf  [num_anchors, 1]    种类置信度 (10647,1)
            #   class_pred  [num_anchors, 1]    当前最大置信度种类 (10647,1)
            #   image_pred => (10647,85)
            # ----------------------------------------------------------#
            class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1, keepdim=True)

            # ----------------------------------------------------------#
            #   利用置信度进行第一轮筛选
            #   image_pred[:, 4]置信度得分 * 最大的种类置信度 大于 conf_thres True
            #   得到一个掩码，类别置信度高于conf_thres的为True
            # ----------------------------------------------------------#
            conf_mask = (image_pred[:, 4] * class_conf[:, 0] >= conf_thres).squeeze()

            # ----------------------------------------------------------#
            #   根据置信度进行预测结果的筛选
            # ----------------------------------------------------------#
            # 只保留掩码中为True的预测框,种类置信度,种类索引
            image_pred = image_pred[conf_mask]
            class_conf = class_conf[conf_mask]
            class_pred = class_pred[conf_mask]
            # (33,85)
            if not image_pred.size(0):
                continue
            # -------------------------------------------------------------------------#
            #   detections  [num_anchors, 7]
            #   7的内容为：x1, y1, x2, y2, obj_conf, class_conf, class_pred
            # -------------------------------------------------------------------------#
            # (33,7)
            detections = torch.cat((image_pred[:, :5], class_conf.float(), class_pred.float()), 1)

            # ------------------------------------------#
            #   获得预测结果中包含的所有种类
            # ------------------------------------------#
            # 提取出所有唯一的类别标签 [0,1,2]
            unique_labels = detections[:, -1].cpu().unique()

            # 是否在GPU设备上
            if prediction.is_cuda:
                # 提取的唯一标签到GPU设备上
                unique_labels = unique_labels.cuda()
                #  从 CPU 转移到 GPU
                detections = detections.cuda()

            for c in unique_labels:
                # ------------------------------------------#
                #   获得某一类得分筛选后全部的预测结果
                # ------------------------------------------#
                # (22,7) == c=0
                detections_class = detections[detections[:, -1] == c]

                # ------------------------------------------#
                #   使用官方自带的非极大抑制会速度更快一些！
                # ------------------------------------------#
                keep = nms(
                    detections_class[:, :4],
                    detections_class[:, 4] * detections_class[:, 5],
                    nms_thres
                )
                # [12,14,8,6,21,19,20,18] 8个索引
                # max_detections => [8,7]
                max_detections = detections_class[keep]

                # # 按照存在物体的置信度排序
                # 预测框置信度 * 类别置信度，降序排序
                # 排序后的索引conf_sort_index (22,1)
                #_, conf_sort_index = torch.sort(detections_class[:, 4]*detections_class[:, 5], descending=True)
                # 根据排序后的索引重新进行排列 (22,7)
                #detections_class = detections_class[conf_sort_index]

                # 进行非极大抑制
                #max_detections = []
                # 张量的第一个维度的大小,也就是它有多少个元素
                #while detections_class.size(0):
                    # 取出这一类置信度最高的，一步一步往下判断，判断重合程度是否大于nms_thres，如果是则去除掉
                    # tensor([0.0586, 0.4037, 0.2033, 0.7166, 0.9997, 1.0000, 0.0000] 表示取出[22,7]中的[0,7]这一行拿出来[7]
                #    print("dete:",detections_class[0])
                    # [1,7]
                #    sss = detections_class[0].unsqueeze(0)
                #    max_detections.append(sss)
                    # 22
                #    print(len(detections_class))
                #    if len(detections_class) == 1:
                #        break
                    # 最大的(max_detections)和，第2个元素开始到最后一个元素的所有元素计算IOU
                #    ious = bbox_iou(max_detections[-1], detections_class[1:])
                    # 只保留那些与置信度最高检测框的 IoU 小于 nms_thres 的检测框。
                #    detections_class = detections_class[1:][ious < nms_thres]
                # # 堆叠
                #max_detections = torch.cat(max_detections).data

                # Add max detections to outputs
                # (12,7)
                output[i] = max_detections if output[i] is None else torch.cat((output[i], max_detections))

            if output[i] is not None:
                # 将output[i]从GPU中转移到CPU然后转换成numpy数组
                output[i] = output[i].cpu().numpy()
                # [12,2],中心坐标 (x, y),宽高
                box_xy, box_wh = (output[i][:, 0:2] + output[i][:, 2:4]) / 2, output[i][:, 2:4] - output[i][:, 0:2]
                output[i][:, :4] = self.yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
        return output

3.4、loss计算

在计算loss的时候，实际上是预测结果和真实之间的对比：
预测就是网络的预测结果。
真实就是网络的真实框情况。

1、判断真实框在图片中的位置，判断其属于哪一个网格点去检测。判断真实框和这个特征点的哪个先验框重合程度最高。计算该网格点应该有怎么样的预测结果才能获得真实框，与真实框重合度最高的先验框被用于作为正样本。

2、根据网络的预测结果获得预测框，计算预测框和所有真实框的重合程度，如果重合程度大于一定门限，则将该预测框对应的先验框忽略。其余作为负样本。

3、最终损失由三个部分组成：a、正样本，编码后的长宽与xy轴偏移量与预测值的差距。b、正样本，预测结果中置信度的值与1对比；负样本，预测结果中置信度的值与0对比。c、实际存在的框，种类预测结果与实际结果的对比。
在这里插入图片描述

import math
from functools import partial

import numpy as np
import torch
import torch.nn as nn


class YOLOLoss(nn.Module):
    def __init__(self, anchors, num_classes, input_shape, cuda, anchors_mask=[[6, 7, 8], [3, 4, 5], [0, 1, 2]]):
        super(YOLOLoss, self).__init__()
        # -----------------------------------------------------------#
        #   13x13的特征层对应的anchor是[116,90],[156,198],[373,326]
        #   26x26的特征层对应的anchor是[30,61],[62,45],[59,119]
        #   52x52的特征层对应的anchor是[10,13],[16,30],[33,23]
        # -----------------------------------------------------------#
        self.anchors = anchors
        self.num_classes = num_classes
        self.bbox_attrs = 5 + num_classes
        self.input_shape = input_shape
        self.anchors_mask = anchors_mask

        self.giou = True
        self.balance = [0.4, 1.0, 4]
        self.box_ratio = 0.05
        self.obj_ratio = 5 * (input_shape[0] * input_shape[1]) / (416 ** 2)
        self.cls_ratio = 1 * (num_classes / 80)

        self.ignore_threshold = 0.5
        self.cuda = cuda

    def clip_by_tensor(self, t, t_min, t_max):
        t = t.float()
        result = (t >= t_min).float() * t + (t < t_min).float() * t_min
        result = (result <= t_max).float() * result + (result > t_max).float() * t_max
        return result

    def MSELoss(self, pred, target):
        return torch.pow(pred - target, 2)

    def BCELoss(self, pred, target):
        epsilon = 1e-7
        pred = self.clip_by_tensor(pred, epsilon, 1.0 - epsilon)
        output = - target * torch.log(pred) - (1.0 - target) * torch.log(1.0 - pred)
        return output

    def box_giou(self, b1, b2):
        """
        输入为：
        ----------
        b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
        b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh

        返回为：
        -------
        giou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
        """
        # ----------------------------------------------------#
        #   求出预测框左上角右下角
        # ----------------------------------------------------#
        b1_xy = b1[..., :2]
        b1_wh = b1[..., 2:4]
        b1_wh_half = b1_wh / 2.
        b1_mins = b1_xy - b1_wh_half
        b1_maxes = b1_xy + b1_wh_half
        # ----------------------------------------------------#
        #   求出真实框左上角右下角
        # ----------------------------------------------------#
        b2_xy = b2[..., :2]
        b2_wh = b2[..., 2:4]
        b2_wh_half = b2_wh / 2.
        b2_mins = b2_xy - b2_wh_half
        b2_maxes = b2_xy + b2_wh_half

        # ----------------------------------------------------#
        #   求真实框和预测框所有的iou
        # ----------------------------------------------------#
        intersect_mins = torch.max(b1_mins, b2_mins)
        intersect_maxes = torch.min(b1_maxes, b2_maxes)
        intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
        intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
        b1_area = b1_wh[..., 0] * b1_wh[..., 1]
        b2_area = b2_wh[..., 0] * b2_wh[..., 1]
        union_area = b1_area + b2_area - intersect_area
        iou = intersect_area / union_area

        # ----------------------------------------------------#
        #   找到包裹两个框的最小框的左上角和右下角
        # ----------------------------------------------------#
        enclose_mins = torch.min(b1_mins, b2_mins)
        enclose_maxes = torch.max(b1_maxes, b2_maxes)
        enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))
        # ----------------------------------------------------#
        #   计算对角线距离
        # ----------------------------------------------------#
        enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1]
        giou = iou - (enclose_area - union_area) / enclose_area

        return giou

    def forward(self, l, input, targets=None):
        # ----------------------------------------------------#
        #   l代表的是，当前输入进来的有效特征层，是第几个有效特征层
        #   input的shape为  bs, 3*(5+num_classes), 13, 13
        #                   bs, 3*(5+num_classes), 26, 26
        #                   bs, 3*(5+num_classes), 52, 52
        #   targets代表的是真实框。
        # ----------------------------------------------------#
        # --------------------------------#
        #   获得图片数量，特征层的高和宽
        #   13和13
        # --------------------------------#
        bs = input.size(0)
        in_h = input.size(2)
        in_w = input.size(3)
        # -----------------------------------------------------------------------#
        #   计算步长
        #   每一个特征点对应原来的图片上多少个像素点
        #   如果特征层为13x13的话，一个特征点就对应原来的图片上的32个像素点
        #   如果特征层为26x26的话，一个特征点就对应原来的图片上的16个像素点
        #   如果特征层为52x52的话，一个特征点就对应原来的图片上的8个像素点
        #   stride_h = stride_w = 32、16、8
        #   stride_h和stride_w都是32。
        # -----------------------------------------------------------------------#
        stride_h = self.input_shape[0] / in_h
        stride_w = self.input_shape[1] / in_w
        # -------------------------------------------------#
        #   此时获得的scaled_anchors大小是相对于特征层的
        # -------------------------------------------------#
        scaled_anchors = [(a_w / stride_w, a_h / stride_h) for a_w, a_h in self.anchors]
        # -----------------------------------------------#
        #   输入的input一共有三个，他们的shape分别是
        #   bs, 3*(5+num_classes), 13, 13 => batch_size, 3, 13, 13, 5 + num_classes
        #   batch_size, 3, 26, 26, 5 + num_classes
        #   batch_size, 3, 52, 52, 5 + num_classes
        # -----------------------------------------------#
        prediction = input.view(bs, len(self.anchors_mask[l]), self.bbox_attrs, in_h, in_w).permute(0, 1, 3, 4,
                                                                                                    2).contiguous()

        # -----------------------------------------------#
        #   先验框的中心位置的调整参数
        # -----------------------------------------------#
        x = torch.sigmoid(prediction[..., 0])
        y = torch.sigmoid(prediction[..., 1])
        # -----------------------------------------------#
        #   先验框的宽高调整参数
        # -----------------------------------------------#
        w = prediction[..., 2]
        h = prediction[..., 3]
        # -----------------------------------------------#
        #   获得置信度，是否有物体
        # -----------------------------------------------#
        conf = torch.sigmoid(prediction[..., 4])
        # -----------------------------------------------#
        #   种类置信度
        # -----------------------------------------------#
        pred_cls = torch.sigmoid(prediction[..., 5:])

        # -----------------------------------------------#
        #   获得网络应该有的预测结果
        # -----------------------------------------------#
        y_true, noobj_mask, box_loss_scale = self.get_target(l, targets, scaled_anchors, in_h, in_w)

        # ---------------------------------------------------------------#
        #   将预测结果进行解码，判断预测结果和真实值的重合程度
        #   如果重合程度过大则忽略，因为这些特征点属于预测比较准确的特征点
        #   作为负样本不合适
        # ----------------------------------------------------------------#
        noobj_mask, pred_boxes = self.get_ignore(l, x, y, h, w, targets, scaled_anchors, in_h, in_w, noobj_mask)

        if self.cuda:
            y_true = y_true.type_as(x)
            noobj_mask = noobj_mask.type_as(x)
            box_loss_scale = box_loss_scale.type_as(x)
        # --------------------------------------------------------------------------#
        #   box_loss_scale是真实框宽高的乘积，宽高均在0-1之间，因此乘积也在0-1之间。
        #   2-宽高的乘积代表真实框越大，比重越小，小框的比重更大。
        # --------------------------------------------------------------------------#
        box_loss_scale = 2 - box_loss_scale

        loss = 0
        obj_mask = y_true[..., 4] == 1
        n = torch.sum(obj_mask)
        if n != 0:
            if self.giou:
                # ---------------------------------------------------------------#
                #   计算预测结果和真实结果的giou
                # ----------------------------------------------------------------#
                giou = self.box_giou(pred_boxes, y_true[..., :4]).type_as(x)
                loss_loc = torch.mean((1 - giou)[obj_mask])
            else:
                # -----------------------------------------------------------#
                #   计算中心偏移情况的loss，使用BCELoss效果好一些
                # -----------------------------------------------------------#
                loss_x = torch.mean(self.BCELoss(x[obj_mask], y_true[..., 0][obj_mask]) * box_loss_scale[obj_mask])
                loss_y = torch.mean(self.BCELoss(y[obj_mask], y_true[..., 1][obj_mask]) * box_loss_scale[obj_mask])
                # -----------------------------------------------------------#
                #   计算宽高调整值的loss
                # -----------------------------------------------------------#
                loss_w = torch.mean(self.MSELoss(w[obj_mask], y_true[..., 2][obj_mask]) * box_loss_scale[obj_mask])
                loss_h = torch.mean(self.MSELoss(h[obj_mask], y_true[..., 3][obj_mask]) * box_loss_scale[obj_mask])
                loss_loc = (loss_x + loss_y + loss_h + loss_w) * 0.1

            loss_cls = torch.mean(self.BCELoss(pred_cls[obj_mask], y_true[..., 5:][obj_mask]))
            loss += loss_loc * self.box_ratio + loss_cls * self.cls_ratio

        loss_conf = torch.mean(self.BCELoss(conf, obj_mask.type_as(conf))[noobj_mask.bool() | obj_mask])
        loss += loss_conf * self.balance[l] * self.obj_ratio
        # if n != 0:
        #     print(loss_loc * self.box_ratio, loss_cls * self.cls_ratio, loss_conf * self.balance[l] * self.obj_ratio)
        return loss

    def calculate_iou(self, _box_a, _box_b):
        # -----------------------------------------------------------#
        #   计算真实框的左上角和右下角
        # -----------------------------------------------------------#
        b1_x1, b1_x2 = _box_a[:, 0] - _box_a[:, 2] / 2, _box_a[:, 0] + _box_a[:, 2] / 2
        b1_y1, b1_y2 = _box_a[:, 1] - _box_a[:, 3] / 2, _box_a[:, 1] + _box_a[:, 3] / 2
        # -----------------------------------------------------------#
        #   计算先验框获得的预测框的左上角和右下角
        # -----------------------------------------------------------#
        b2_x1, b2_x2 = _box_b[:, 0] - _box_b[:, 2] / 2, _box_b[:, 0] + _box_b[:, 2] / 2
        b2_y1, b2_y2 = _box_b[:, 1] - _box_b[:, 3] / 2, _box_b[:, 1] + _box_b[:, 3] / 2

        # -----------------------------------------------------------#
        #   将真实框和预测框都转化成左上角右下角的形式
        # -----------------------------------------------------------#
        box_a = torch.zeros_like(_box_a)
        box_b = torch.zeros_like(_box_b)
        box_a[:, 0], box_a[:, 1], box_a[:, 2], box_a[:, 3] = b1_x1, b1_y1, b1_x2, b1_y2
        box_b[:, 0], box_b[:, 1], box_b[:, 2], box_b[:, 3] = b2_x1, b2_y1, b2_x2, b2_y2

        # -----------------------------------------------------------#
        #   A为真实框的数量，B为先验框的数量
        # -----------------------------------------------------------#
        A = box_a.size(0)
        B = box_b.size(0)

        # -----------------------------------------------------------#
        #   计算交的面积
        # -----------------------------------------------------------#
        max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2), box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
        min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2), box_b[:, :2].unsqueeze(0).expand(A, B, 2))
        inter = torch.clamp((max_xy - min_xy), min=0)
        inter = inter[:, :, 0] * inter[:, :, 1]
        # -----------------------------------------------------------#
        #   计算预测框和真实框各自的面积
        # -----------------------------------------------------------#
        area_a = ((box_a[:, 2] - box_a[:, 0]) * (box_a[:, 3] - box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]
        area_b = ((box_b[:, 2] - box_b[:, 0]) * (box_b[:, 3] - box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]
        # -----------------------------------------------------------#
        #   求IOU
        # -----------------------------------------------------------#
        union = area_a + area_b - inter
        return inter / union  # [A,B]

    def get_target(self, l, targets, anchors, in_h, in_w):
        # -----------------------------------------------------#
        #   计算一共有多少张图片
        # -----------------------------------------------------#
        bs = len(targets)
        # -----------------------------------------------------#
        #   用于选取哪些先验框不包含物体
        # -----------------------------------------------------#
        # [1,3,13,13]默认情况全部置为1，表示所有先验框都不包含物体 requires_grad=False 表示这个张量不需要计算梯度
        noobj_mask = torch.ones(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad=False)
        # -----------------------------------------------------#
        #   让网络更加去关注小目标
        # -----------------------------------------------------#
        # [1,3,13,13]默认情况全部置为0
        box_loss_scale = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad=False)
        # -----------------------------------------------------#
        #   batch_size, 3, 13, 13, 5 + num_classes
        # -----------------------------------------------------#
        y_true = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, self.bbox_attrs, requires_grad=False)

        # 对每一张图片进行单独处理
        for b in range(bs):
            if len(targets[b]) == 0:
                continue
            # 获取每一张图片真实框对情况，创建一个和targets[b]相同的全0张量
            batch_target = torch.zeros_like(targets[b])
            # -------------------------------------------------------#
            #   计算出正样本在特征层上的中心点
            # -------------------------------------------------------#
            # 真实框获得的是归一化的结果，* 特征层的宽高就将其映射到了特征层上
            batch_target[:, [0, 2]] = targets[b][:, [0, 2]] * in_w
            batch_target[:, [1, 3]] = targets[b][:, [1, 3]] * in_h
            batch_target[:, 4] = targets[b][:, 4]
            batch_target = batch_target.cpu()

            # -------------------------------------------------------#
            #   将真实框转换一个形式
            #   num_true_box, 4
            # -------------------------------------------------------#
            gt_box = torch.FloatTensor(torch.cat((torch.zeros((batch_target.size(0), 2)), batch_target[:, 2:4]), 1))
            # -------------------------------------------------------#
            #   将先验框转换一个形式
            #   9, 4
            # -------------------------------------------------------#
            anchor_shapes = torch.FloatTensor(
                torch.cat((torch.zeros((len(anchors), 2)), torch.FloatTensor(anchors)), 1))
            # -------------------------------------------------------#
            #   计算交并比
            #   self.calculate_iou(gt_box, anchor_shapes) = [num_true_box, 9]每一个真实框和9个先验框的重合情况
            #   best_ns:
            #   [每个真实框最大的重合度max_iou, 每一个真实框最重合的先验框的序号]
            # -------------------------------------------------------#
            best_ns = torch.argmax(self.calculate_iou(gt_box, anchor_shapes), dim=-1)
            # t代表的是每一个真实框最大重合度
            # best_n每一个先验框最重合先验框序号
            for t, best_n in enumerate(best_ns):
                if best_n not in self.anchors_mask[l]:
                    continue
                # ----------------------------------------#
                #   判断这个先验框是当前特征点的哪一个先验框
                # ----------------------------------------#
                k = self.anchors_mask[l].index(best_n)
                # ----------------------------------------#
                #   获得真实框属于哪个网格点
                # ----------------------------------------#
                i = torch.floor(batch_target[t, 0]).long()
                j = torch.floor(batch_target[t, 1]).long()
                # ----------------------------------------#
                #   取出真实框的种类
                # ----------------------------------------#
                c = batch_target[t, 4].long()

                # ----------------------------------------#
                #   noobj_mask代表无目标的特征点
                # ----------------------------------------#
                noobj_mask[b, k, j, i] = 0
                # ----------------------------------------#
                #   tx、ty代表中心调整参数的真实值
                # ----------------------------------------#
                if not self.giou:
                    # ----------------------------------------#
                    #   tx、ty代表中心调整参数的真实值
                    # ----------------------------------------#
                    y_true[b, k, j, i, 0] = batch_target[t, 0] - i.float()
                    y_true[b, k, j, i, 1] = batch_target[t, 1] - j.float()
                    y_true[b, k, j, i, 2] = math.log(batch_target[t, 2] / anchors[best_n][0])
                    y_true[b, k, j, i, 3] = math.log(batch_target[t, 3] / anchors[best_n][1])
                    y_true[b, k, j, i, 4] = 1
                    y_true[b, k, j, i, c + 5] = 1
                else:
                    # ----------------------------------------#
                    #   tx、ty代表中心调整参数的真实值
                    # ----------------------------------------#
                    y_true[b, k, j, i, 0] = batch_target[t, 0]
                    y_true[b, k, j, i, 1] = batch_target[t, 1]
                    y_true[b, k, j, i, 2] = batch_target[t, 2]
                    y_true[b, k, j, i, 3] = batch_target[t, 3]
                    y_true[b, k, j, i, 4] = 1
                    y_true[b, k, j, i, c + 5] = 1
                # ----------------------------------------#
                #   用于获得xywh的比例
                #   大目标loss权重小，小目标loss权重大
                # ----------------------------------------#
                box_loss_scale[b, k, j, i] = batch_target[t, 2] * batch_target[t, 3] / in_w / in_h
        return y_true, noobj_mask, box_loss_scale

    def get_ignore(self, l, x, y, h, w, targets, scaled_anchors, in_h, in_w, noobj_mask):
        # -----------------------------------------------------#
        #   计算一共有多少张图片
        # -----------------------------------------------------#
        bs = len(targets)

        # -----------------------------------------------------#
        #   生成网格，先验框中心，网格左上角
        # -----------------------------------------------------#
        grid_x = torch.linspace(0, in_w - 1, in_w).repeat(in_h, 1).repeat(
            int(bs * len(self.anchors_mask[l])), 1, 1).view(x.shape).type_as(x)
        grid_y = torch.linspace(0, in_h - 1, in_h).repeat(in_w, 1).t().repeat(
            int(bs * len(self.anchors_mask[l])), 1, 1).view(y.shape).type_as(x)

        # 生成先验框的宽高
        scaled_anchors_l = np.array(scaled_anchors)[self.anchors_mask[l]]
        anchor_w = torch.Tensor(scaled_anchors_l).index_select(1, torch.LongTensor([0])).type_as(x)
        anchor_h = torch.Tensor(scaled_anchors_l).index_select(1, torch.LongTensor([1])).type_as(x)

        anchor_w = anchor_w.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(w.shape)
        anchor_h = anchor_h.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(h.shape)
        # -------------------------------------------------------#
        #   计算调整后的先验框中心与宽高
        # -------------------------------------------------------#
        pred_boxes_x = torch.unsqueeze(x + grid_x, -1)
        pred_boxes_y = torch.unsqueeze(y + grid_y, -1)
        pred_boxes_w = torch.unsqueeze(torch.exp(w) * anchor_w, -1)
        pred_boxes_h = torch.unsqueeze(torch.exp(h) * anchor_h, -1)
        pred_boxes = torch.cat([pred_boxes_x, pred_boxes_y, pred_boxes_w, pred_boxes_h], dim=-1)

        for b in range(bs):
            # -------------------------------------------------------#
            #   将预测结果转换一个形式
            #   pred_boxes_for_ignore      num_anchors, 4
            # -------------------------------------------------------#
            pred_boxes_for_ignore = pred_boxes[b].view(-1, 4)
            # -------------------------------------------------------#
            #   计算真实框，并把真实框转换成相对于特征层的大小
            #   gt_box      num_true_box, 4
            # -------------------------------------------------------#
            if len(targets[b]) > 0:
                batch_target = torch.zeros_like(targets[b])
                # -------------------------------------------------------#
                #   计算出正样本在特征层上的中心点
                # -------------------------------------------------------#
                batch_target[:, [0, 2]] = targets[b][:, [0, 2]] * in_w
                batch_target[:, [1, 3]] = targets[b][:, [1, 3]] * in_h
                batch_target = batch_target[:, :4].type_as(x)
                # -------------------------------------------------------#
                #   计算交并比
                #   anch_ious       num_true_box, num_anchors
                # -------------------------------------------------------#
                anch_ious = self.calculate_iou(batch_target, pred_boxes_for_ignore)
                # -------------------------------------------------------#
                #   每个先验框对应真实框的最大重合度
                #   anch_ious_max   num_anchors
                # -------------------------------------------------------#
                anch_ious_max, _ = torch.max(anch_ious, dim=0)
                anch_ious_max = anch_ious_max.view(pred_boxes[b].size()[:3])
                noobj_mask[b][anch_ious_max > self.ignore_threshold] = 0
        return noobj_mask, pred_boxes


def weights_init(net, init_type='normal', init_gain=0.02):
    def init_func(m):
        classname = m.__class__.__name__
        if hasattr(m, 'weight') and classname.find('Conv') != -1:
            if init_type == 'normal':
                torch.nn.init.normal_(m.weight.data, 0.0, init_gain)
            elif init_type == 'xavier':
                torch.nn.init.xavier_normal_(m.weight.data, gain=init_gain)
            elif init_type == 'kaiming':
                torch.nn.init.kaiming_normal_(m.weight.data, a=0, mode='fan_in')
            elif init_type == 'orthogonal':
                torch.nn.init.orthogonal_(m.weight.data, gain=init_gain)
            else:
                raise NotImplementedError('initialization method [%s] is not implemented' % init_type)
        elif classname.find('BatchNorm2d') != -1:
            torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
            torch.nn.init.constant_(m.bias.data, 0.0)

    print('initialize network with %s type' % init_type)
    net.apply(init_func)


def get_lr_scheduler(lr_decay_type, lr, min_lr, total_iters, warmup_iters_ratio=0.05, warmup_lr_ratio=0.1,
                     no_aug_iter_ratio=0.05, step_num=10):
    def yolox_warm_cos_lr(lr, min_lr, total_iters, warmup_total_iters, warmup_lr_start, no_aug_iter, iters):
        if iters <= warmup_total_iters:
            # lr = (lr - warmup_lr_start) * iters / float(warmup_total_iters) + warmup_lr_start
            lr = (lr - warmup_lr_start) * pow(iters / float(warmup_total_iters), 2) + warmup_lr_start
        elif iters >= total_iters - no_aug_iter:
            lr = min_lr
        else:
            lr = min_lr + 0.5 * (lr - min_lr) * (
                    1.0 + math.cos(
                math.pi * (iters - warmup_total_iters) / (total_iters - warmup_total_iters - no_aug_iter))
            )
        return lr

    def step_lr(lr, decay_rate, step_size, iters):
        if step_size < 1:
            raise ValueError("step_size must above 1.")
        n = iters // step_size
        out_lr = lr * decay_rate ** n
        return out_lr

    if lr_decay_type == "cos":
        warmup_total_iters = min(max(warmup_iters_ratio * total_iters, 1), 3)
        warmup_lr_start = max(warmup_lr_ratio * lr, 1e-6)
        no_aug_iter = min(max(no_aug_iter_ratio * total_iters, 1), 15)
        func = partial(yolox_warm_cos_lr, lr, min_lr, total_iters, warmup_total_iters, warmup_lr_start, no_aug_iter)
    else:
        decay_rate = (min_lr / lr) ** (1 / (step_num - 1))
        step_size = total_iters / step_num
        func = partial(step_lr, lr, decay_rate, step_size)

    return func


def set_optimizer_lr(optimizer, lr_scheduler_func, epoch):
    lr = lr_scheduler_func(epoch)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

3.5、改进1

正样本：如果当前预测的包围框比之前其他的任何包围框更好的与ground truth对象重合，那它的置信度就是 1。
忽略样本：如果当前预测的包围框不是最好的，但它和 ground truth对象重合了一定的阈值（这里是0.5）以上，神经网络会忽略这个预测。
负样本: 若bounding box 没有与任一ground truth对象对应，那它的置信度就是 0

3.6、改进2

YOLOv3采用FPN的方法，采用多尺度的特征图对不同大小的物体进行检测，以提升小物体的预测能力。分别进行下采样32倍、16倍和8倍得到3个不同尺度的特征图，每个尺度的特征图上使用3个anchor，每种尺度上都可以得到NN（3*85）个结果

3.7、Darknet-53总结

Darknet-53主要做了如下改进：
（1）没有采用最大池化层，转而采用步长为2的卷积层进行下采样。
（2）为了防止过拟合，在每个卷积层之后加入了一个BN层和一个Leaky ReLU。
（3）引入了残差网络的思想，目的是为了让网络可以提取到更深层的特征，同时避免出现梯度消失或爆炸。
（4）将网络的中间层和后面某一层的上采样进行张量拼接，达到多尺度特征融合的目的。

深度学习不同层提取到的特征是不一样的：
浅层学习到的是边缘形状转角板块颜色细粒度的信息
深层学习到的是纹理眼睛腿汽车特化的语义信息

yolov1 输出是7*7*30
yolov2 输出是13*13*5*25
yolov3:
darknet53：52个卷积层和1个全联接层
输入图像为416*416
生产（grid cell），每个grid cell大小是（8、16、32），13*13是个数
13*13 -》 下采样32倍（大）
26*26 -》 下采样16倍（中）
52*52 -》 下采样8倍（小）

不是看那个grid cell了，而是看谁的anchors与这个标记框的IOU最大（每个grid cell有9个anchors），就由他去预测（正样本，也就是去负责去预测，其他非最大的就不是正样本）

正样本会在所有项中计算损失产生贡献（定位、置信度、分类）
负样本产生贡献（置信度）

3.8、正负样本匹配

什么是正负样本?

正负样本是在训练过程中计算损失用的，而在预测过程和验证过程是没有这个概念的。正样本并不是手动标注的GT。正负样本都是针对于算法经过处理生成的框而言，而非原始的GT数据。正例是用来使预测结果更靠近真实值的，负例是用来使预测结果更远离除了真实值之外的值的。

训练的时候为什么需要进行正负样本筛选？

在目标检测中不能将所有的预测框都进入损失函数进行计算，主要原因是框太多，参数量太大，因此需要先将正负样本选择出来，再进行损失函数的计算。对于正样本，是回归与分类都进行，而负样本由于没有回归的对象，不进行回归，只进行分类（分类为背景）。

为什么要训练负样本？

训练负样本的目的是为了降低误检测率、误识别率，提高网络模型的泛化能力。通俗地讲就是告诉检测器，这些“不是你要检测的目标”。正负样本的比例最好为1：1到1：2左右，数量差距不能太悬殊，特别是正样本数量本来就不太多的情况下。如果负样本远多于正样本，则负样本会淹没正样本的损失，从而降低网络收敛的效率与检测精度。这就是目标检测中常见的正负样本不均衡问题，解决方案之一是增加正样本数。

yolov3正负样本定义？

yolov3是基于anchor和GT的IOU进行分配正负样本的。
步骤如下：

步骤1：每一个目标都只有一个正样本，max-iou matching策略，匹配规则为IOU最大（没有阈值），选取出来的即为正样本；（每个目标只有一个正样本，就是这个目标先选择大中小三层中的一层，再在这层中选择一个网格，在网格上选择三个anchor中的一个）

步骤2：IOU<0.2（人为设定阈值）的作为负样本；

步骤3：除了正负样本，其余的全部为忽略样本

比如drbox（drbox就是anchor调整后的预测框）与gtbox的IOU最大为0.9，设置IOU小于0.2的为负样本。

那么有一个IOU为0.8的box，那么这个box就是忽略样本，有一个box的IOU为0.1，那么就是负样本。

步骤4：正anchor用于分类和回归的学习，正负anchor用于置信度confidence的学习，忽略样本不考虑。

4、yolov4总结

4.1、网络部分

睿智的目标检测30——Pytorch搭建YoloV4目标检测平台
【YOLO系列】–YOLOv4超详细解读/总结（网络结构）

主干特征提取网络Backbone的改进点有两个：
a).主干特征提取网络：DarkNet53 => CSPDarkNet53
b).激活函数：使用Mish激活函数

CSPnet结构并不算复杂，就是将原来的残差块的堆叠进行了一个拆分，拆成左右两部分：
主干部分继续进行原来的残差块的堆叠；
另一部分则像一个残差边一样，经过少量处理直接连接到最后。
因此可以认为CSP中存在一个大的残差边。
spp它能够极大地增加感受野，分离出最显著的上下文特征。
在这里插入图片描述

原始的PANet的结构，可以看出来其具有一个非常重要的特点就是特征的反复提取。在（a）里面是传统的特征金字塔结构，在完成特征金字塔从下到上的特征提取后，还需要实现（b）中从上到下的特征提取。

import math
from collections import OrderedDict

import torch
import torch.nn as nn
import torch.nn.functional as F


#-------------------------------------------------#
#   MISH激活函数
#-------------------------------------------------#
class Mish(nn.Module):
    def __init__(self):
        super(Mish, self).__init__()

    def forward(self, x):
        # F.softplus(x) = log(1 + exp(x))
        return x * torch.tanh(F.softplus(x))

#---------------------------------------------------#
#   卷积块 -> 卷积 + 标准化 + 激活函数 （CBM）
#   Conv2d + BatchNormalization + Mish
#---------------------------------------------------#
class BasicConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1):
        super(BasicConv, self).__init__()

        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, kernel_size//2, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.activation = Mish()

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.activation(x)
        return x

#---------------------------------------------------#
#   CSPdarknet的结构块的组成部分 （Res unit）
#   内部堆叠的残差块
#---------------------------------------------------#
class Resblock(nn.Module):
    def __init__(self, channels, hidden_channels=None):
        super(Resblock, self).__init__()

        if hidden_channels is None:
            hidden_channels = channels

        self.block = nn.Sequential(
            BasicConv(channels, hidden_channels, 1),
            BasicConv(hidden_channels, channels, 3)
        )

    def forward(self, x):
        return x + self.block(x)

#--------------------------------------------------------------------#
#   CSPdarknet的结构块
#   首先利用ZeroPadding2D和一个步长为2x2的卷积块进行高和宽的压缩
#   然后建立一个大的残差边shortconv、这个大残差边绕过了很多的残差结构
#   主干部分会对num_blocks进行循环，循环内部是残差结构。
#   对于整个CSPdarknet的结构块，就是一个大残差块+内部多个小残差块
#--------------------------------------------------------------------#
class Resblock_body(nn.Module):
    def __init__(self, in_channels, out_channels, num_blocks, first):
        super(Resblock_body, self).__init__()
        #----------------------------------------------------------------#
        #   利用一个步长为2x2的卷积块进行高和宽的压缩
        #----------------------------------------------------------------#
        self.downsample_conv = BasicConv(in_channels, out_channels, 3, stride=2)

        if first:
            #--------------------------------------------------------------------------#
            #   然后建立一个大的残差边self.split_conv0、这个大残差边绕过了很多的残差结构
            #--------------------------------------------------------------------------#
            self.split_conv0 = BasicConv(out_channels, out_channels, 1)

            #----------------------------------------------------------------#
            #   主干部分会对num_blocks进行循环，循环内部是残差结构。
            #----------------------------------------------------------------#
            self.split_conv1 = BasicConv(out_channels, out_channels, 1)  
            self.blocks_conv = nn.Sequential(
                Resblock(channels=out_channels, hidden_channels=out_channels//2),
                BasicConv(out_channels, out_channels, 1)
            )

            self.concat_conv = BasicConv(out_channels*2, out_channels, 1)
        else:
            #--------------------------------------------------------------------------#
            #   然后建立一个大的残差边self.split_conv0、这个大残差边绕过了很多的残差结构
            #--------------------------------------------------------------------------#
            self.split_conv0 = BasicConv(out_channels, out_channels//2, 1)

            #----------------------------------------------------------------#
            #   主干部分会对num_blocks进行循环，循环内部是残差结构。
            #----------------------------------------------------------------#
            self.split_conv1 = BasicConv(out_channels, out_channels//2, 1)
            self.blocks_conv = nn.Sequential(
                *[Resblock(out_channels//2) for _ in range(num_blocks)],
                BasicConv(out_channels//2, out_channels//2, 1)
            )

            self.concat_conv = BasicConv(out_channels, out_channels, 1)

    def forward(self, x):
        x = self.downsample_conv(x)

        x0 = self.split_conv0(x)

        x1 = self.split_conv1(x)
        x1 = self.blocks_conv(x1)

        #------------------------------------#
        #   将大残差边再堆叠回来
        #------------------------------------#
        x = torch.cat([x1, x0], dim=1)
        #------------------------------------#
        #   最后对通道数进行整合
        #------------------------------------#
        x = self.concat_conv(x)

        return x

#---------------------------------------------------#
#   CSPdarknet53 的主体部分
#   输入为一张416x416x3的图片
#   输出为三个有效特征层
#---------------------------------------------------#
class CSPDarkNet(nn.Module):
    def __init__(self, layers):
        super(CSPDarkNet, self).__init__()
        self.inplanes = 32
        # 416,416,3 -> 416,416,32
        self.conv1 = BasicConv(3, self.inplanes, kernel_size=3, stride=1)
        self.feature_channels = [64, 128, 256, 512, 1024]

        self.stages = nn.ModuleList([
            # 416,416,32 -> 208,208,64
            Resblock_body(self.inplanes, self.feature_channels[0], layers[0], first=True),
            # 208,208,64 -> 104,104,128
            Resblock_body(self.feature_channels[0], self.feature_channels[1], layers[1], first=False),
            # 104,104,128 -> 52,52,256
            Resblock_body(self.feature_channels[1], self.feature_channels[2], layers[2], first=False),
            # 52,52,256 -> 26,26,512
            Resblock_body(self.feature_channels[2], self.feature_channels[3], layers[3], first=False),
            # 26,26,512 -> 13,13,1024
            Resblock_body(self.feature_channels[3], self.feature_channels[4], layers[4], first=False)
        ])

        self.num_features = 1
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()


    def forward(self, x):
        x = self.conv1(x)

        x = self.stages[0](x)
        x = self.stages[1](x)
        out3 = self.stages[2](x)
        out4 = self.stages[3](out3)
        out5 = self.stages[4](out4)

        return out3, out4, out5
    
def darknet53(pretrained):
    model = CSPDarkNet([1, 2, 8, 8, 4])
    if pretrained:
        model.load_state_dict(torch.load("model_data/CSPdarknet53_backbone_weights.pth"))
    return model

PANet结构

from collections import OrderedDict

import torch
import torch.nn as nn

from nets.CSPdarknet import darknet53


def conv2d(filter_in, filter_out, kernel_size, stride=1):
    pad = (kernel_size - 1) // 2 if kernel_size else 0
    return nn.Sequential(OrderedDict([
        ("conv", nn.Conv2d(filter_in, filter_out, kernel_size=kernel_size, stride=stride, padding=pad, bias=False)),
        ("bn", nn.BatchNorm2d(filter_out)),
        ("relu", nn.LeakyReLU(0.1)),
    ]))

#---------------------------------------------------#
#   SPP结构，利用不同大小的池化核进行池化
#   池化后堆叠
#---------------------------------------------------#
class SpatialPyramidPooling(nn.Module):
    def __init__(self, pool_sizes=[5, 9, 13]):
        super(SpatialPyramidPooling, self).__init__()

        self.maxpools = nn.ModuleList([nn.MaxPool2d(pool_size, 1, pool_size//2) for pool_size in pool_sizes])

    def forward(self, x):
        features = [maxpool(x) for maxpool in self.maxpools[::-1]]
        features = torch.cat(features + [x], dim=1)

        return features

#---------------------------------------------------#
#   卷积 + 上采样
#---------------------------------------------------#
class Upsample(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(Upsample, self).__init__()

        self.upsample = nn.Sequential(
            conv2d(in_channels, out_channels, 1),
            nn.Upsample(scale_factor=2, mode='nearest')
        )

    def forward(self, x,):
        x = self.upsample(x)
        return x

#---------------------------------------------------#
#   三次卷积块
#---------------------------------------------------#
def make_three_conv(filters_list, in_filters):
    m = nn.Sequential(
        conv2d(in_filters, filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
    )
    return m

#---------------------------------------------------#
#   五次卷积块
#---------------------------------------------------#
def make_five_conv(filters_list, in_filters):
    m = nn.Sequential(
        conv2d(in_filters, filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
    )
    return m

#---------------------------------------------------#
#   最后获得yolov4的输出
#---------------------------------------------------#
def yolo_head(filters_list, in_filters):
    m = nn.Sequential(
        conv2d(in_filters, filters_list[0], 3),
        nn.Conv2d(filters_list[0], filters_list[1], 1),
    )
    return m

#---------------------------------------------------#
#   yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
    def __init__(self, anchors_mask, num_classes, pretrained = False):
        super(YoloBody, self).__init__()
        #---------------------------------------------------#   
        #   生成CSPdarknet53的主干模型
        #   获得三个有效特征层，他们的shape分别是：
        #   52,52,256
        #   26,26,512
        #   13,13,1024
        #---------------------------------------------------#
        self.backbone   = darknet53(pretrained)

        self.conv1      = make_three_conv([512,1024],1024)
        self.SPP        = SpatialPyramidPooling()
        self.conv2      = make_three_conv([512,1024],2048)

        self.upsample1          = Upsample(512,256)
        self.conv_for_P4        = conv2d(512,256,1)
        self.make_five_conv1    = make_five_conv([256, 512],512)

        self.upsample2          = Upsample(256,128)
        self.conv_for_P3        = conv2d(256,128,1)
        self.make_five_conv2    = make_five_conv([128, 256],256)

        # 3*(5+num_classes) = 3*(5+20) = 3*(4+1+20)=75
        self.yolo_head3         = yolo_head([256, len(anchors_mask[0]) * (5 + num_classes)],128)

        self.down_sample1       = conv2d(128,256,3,stride=2)
        self.make_five_conv3    = make_five_conv([256, 512],512)

        # 3*(5+num_classes) = 3*(5+20) = 3*(4+1+20)=75
        self.yolo_head2         = yolo_head([512, len(anchors_mask[1]) * (5 + num_classes)],256)

        self.down_sample2       = conv2d(256,512,3,stride=2)
        self.make_five_conv4    = make_five_conv([512, 1024],1024)

        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        self.yolo_head1         = yolo_head([1024, len(anchors_mask[2]) * (5 + num_classes)],512)


    def forward(self, x):
        #  backbone
        x2, x1, x0 = self.backbone(x)

        # 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512 -> 13,13,2048 
        P5 = self.conv1(x0)
        P5 = self.SPP(P5)
        # 13,13,2048 -> 13,13,512 -> 13,13,1024 -> 13,13,512
        P5 = self.conv2(P5)

        # 13,13,512 -> 13,13,256 -> 26,26,256
        P5_upsample = self.upsample1(P5)
        # 26,26,512 -> 26,26,256
        P4 = self.conv_for_P4(x1)
        # 26,26,256 + 26,26,256 -> 26,26,512
        P4 = torch.cat([P4,P5_upsample],axis=1)
        # 26,26,512 -> 26,26,256 -> 26,26,512 -> 26,26,256 -> 26,26,512 -> 26,26,256
        P4 = self.make_five_conv1(P4)

        # 26,26,256 -> 26,26,128 -> 52,52,128
        P4_upsample = self.upsample2(P4)
        # 52,52,256 -> 52,52,128
        P3 = self.conv_for_P3(x2)
        # 52,52,128 + 52,52,128 -> 52,52,256
        P3 = torch.cat([P3,P4_upsample],axis=1)
        # 52,52,256 -> 52,52,128 -> 52,52,256 -> 52,52,128 -> 52,52,256 -> 52,52,128
        P3 = self.make_five_conv2(P3)

        # 52,52,128 -> 26,26,256
        P3_downsample = self.down_sample1(P3)
        # 26,26,256 + 26,26,256 -> 26,26,512
        P4 = torch.cat([P3_downsample,P4],axis=1)
        # 26,26,512 -> 26,26,256 -> 26,26,512 -> 26,26,256 -> 26,26,512 -> 26,26,256
        P4 = self.make_five_conv3(P4)

        # 26,26,256 -> 13,13,512
        P4_downsample = self.down_sample2(P4)
        # 13,13,512 + 13,13,512 -> 13,13,1024
        P5 = torch.cat([P4_downsample,P5],axis=1)
        # 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512
        P5 = self.make_five_conv4(P5)

        #---------------------------------------------------#
        #   第三个特征层
        #   y3=(batch_size,75,52,52)
        #---------------------------------------------------#
        out2 = self.yolo_head3(P3)
        #---------------------------------------------------#
        #   第二个特征层
        #   y2=(batch_size,75,26,26)
        #---------------------------------------------------#
        out1 = self.yolo_head2(P4)
        #---------------------------------------------------#
        #   第一个特征层
        #   y1=(batch_size,75,13,13)
        #---------------------------------------------------#
        out0 = self.yolo_head1(P5)

        return out0, out1, out2

4.2、先验框计算过程

在这里插入图片描述

# -------------------------------------------------------------------------------------------------------#
#   kmeans虽然会对数据集中的框进行聚类，但是很多数据集由于框的大小相近，聚类出来的9个框相差不大，
#   这样的框反而不利于模型的训练。因为不同的特征层适合不同大小的先验框，shape越小的特征层适合越大的先验框
#   原始网络的先验框已经按大中小比例分配好了，不进行聚类也会有非常好的效果。
# -------------------------------------------------------------------------------------------------------#
import glob
import xml.etree.ElementTree as ET

import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm


def cas_iou(box, cluster):
    x = np.minimum(cluster[:, 0], box[0])
    y = np.minimum(cluster[:, 1], box[1])

    intersection = x * y
    area1 = box[0] * box[1]

    area2 = cluster[:, 0] * cluster[:, 1]
    iou = intersection / (area1 + area2 - intersection)

    return iou


def avg_iou(box, cluster):
    return np.mean([np.max(cas_iou(box[i], cluster)) for i in range(box.shape[0])])


###########################
#   box = data
#   k = 9
###########################

# k-means聚类算法是通过随机选择N个聚类中心，找到所有点距离最近的中心，计算属于每个中心的点的平均值，用平均值更新中心位置，重复上述不断更新中心位置，直到中心位置不变为止
def kmeans(box, k):
    # -------------------------------------------------------------#
    #   取出一共有多少框
    #   row = 1399
    # -------------------------------------------------------------#
    row = box.shape[0]

    # -------------------------------------------------------------#
    #   每个框各个点的位置
    #   row 行 k 列的二维数组
    #   (1399,9)
    # -------------------------------------------------------------#
    distance = np.empty((row, k))

    # -------------------------------------------------------------#
    #   最后的聚类位置(1399)
    # -------------------------------------------------------------#
    last_clu = np.zeros((row,))

    np.random.seed()

    # -------------------------------------------------------------#
    #   随机选9个当聚类中心
    #   replace=False 表示选择的元素不会重复
    # -------------------------------------------------------------#
    cluster = box[np.random.choice(row, k, replace=False)]

    iter = 0
    while True:
        # -------------------------------------------------------------#
        #   计算每一个当前框和先验框的宽高比例
        #   对每一个框进行循环
        #   得到结果就是当前框与9个先验框的比值
        # -------------------------------------------------------------#
        for i in range(row):
            distance[i] = 1 - cas_iou(box[i], cluster)

        # -------------------------------------------------------------#
        #   取出最小点
        #   找出每一行中最小，距离最近的先验框
        # -------------------------------------------------------------#
        near = np.argmin(distance, axis=1)

        # 相等将不在更新
        if (last_clu == near).all():
            break

        # -------------------------------------------------------------#
        #   求每一个类的中位点
        #   更新9个聚类框
        # -------------------------------------------------------------#
        for j in range(k):
            cluster[j] = np.median(
                # 选取出near == j框的平均是进行更新
                box[near == j], axis=0)

        last_clu = near
        if iter % 5 == 0:
            print('iter: {:d}. avg_iou:{:.2f}'.format(iter, avg_iou(box, cluster)))
        iter += 1

    return cluster, near


def load_data(path):
    data = []
    # -------------------------------------------------------------#
    #   对于每一个xml都寻找box
    # -------------------------------------------------------------#
    for xml_file in tqdm(glob.glob('{}/*xml'.format(path))):
        tree = ET.parse(xml_file)
        height = int(tree.findtext('./size/height'))
        width = int(tree.findtext('./size/width'))
        if height <= 0 or width <= 0:
            continue

        # -------------------------------------------------------------#
        #   对于每一个目标都获得它的宽高
        # -------------------------------------------------------------#
        for obj in tree.iter('object'):
            xmin = int(float(obj.findtext('bndbox/xmin'))) / width
            ymin = int(float(obj.findtext('bndbox/ymin'))) / height
            xmax = int(float(obj.findtext('bndbox/xmax'))) / width
            ymax = int(float(obj.findtext('bndbox/ymax'))) / height

            xmin = np.float64(xmin)
            ymin = np.float64(ymin)
            xmax = np.float64(xmax)
            ymax = np.float64(ymax)
            # 得到宽高
            data.append([xmax - xmin, ymax - ymin])
    return np.array(data)


if __name__ == '__main__':
    np.random.seed(0)
    # -------------------------------------------------------------#
    #   运行该程序会计算'./VOCdevkit/VOC2007/Annotations'的xml
    #   会生成yolo_anchors.txt
    # -------------------------------------------------------------#
    input_shape = [416, 416]
    anchors_num = 9
    # -------------------------------------------------------------#
    #   载入数据集，可以使用VOC的xml
    # -------------------------------------------------------------#
    path = 'VOCdevkit/VOC2007/Annotations'

    # -------------------------------------------------------------#
    #   载入所有的xml
    #   存储格式为转化为比例后的width,height
    # -------------------------------------------------------------#
    print('Load xmls.')
    data = load_data(path)
    print('Load xmls done.')

    # -------------------------------------------------------------#
    #   使用k聚类算法
    # -------------------------------------------------------------#
    print('K-means boxes.')
    cluster, near = kmeans(data, anchors_num)
    print('K-means boxes done.')
    data = data * np.array([input_shape[1], input_shape[0]])
    cluster = cluster * np.array([input_shape[1], input_shape[0]])

    # -------------------------------------------------------------#
    #   绘图
    # -------------------------------------------------------------#
    for j in range(anchors_num):
        plt.scatter(data[near == j][:, 0], data[near == j][:, 1])
        plt.scatter(cluster[j][0], cluster[j][1], marker='x', c='black')
    plt.savefig("kmeans_for_anchors.jpg")
    plt.show()
    print('Save kmeans_for_anchors.jpg in root dir.')

    cluster = cluster[np.argsort(cluster[:, 0] * cluster[:, 1])]
    print('avg_ratio:{:.2f}'.format(avg_iou(data, cluster)))
    print(cluster)

    f = open("yolo_anchors.txt", 'w')
    row = np.shape(cluster)[0]
    for i in range(row):
        if i == 0:
            x_y = "%d,%d" % (cluster[i][0], cluster[i][1])
        else:
            x_y = ", %d,%d" % (cluster[i][0], cluster[i][1])
        f.write(x_y)
    f.close()

4.3、CIOU计算

IoU是比值的概念，对目标物体的scale是不敏感的。然而常用的BBox的回归损失优化和IoU优化不是完全等价的，寻常的IoU无法直接优化没有重叠的部分。
于是有人提出直接使用IOU作为回归优化loss，CIOU是其中非常优秀的一种想法。
CIOU将目标与anchor之间的距离，重叠率、尺度以及惩罚项都考虑进去，使得目标框回归变得更加稳定，不会像IoU和GIoU一样出现训练过程中发散等问题。而惩罚因子把预测框长宽比拟合目标框的长宽比考虑进去。

在这里插入图片描述

    def box_iou(self, b1, b2):
        """
        输入为：
        ----------
        # b1预测框
        # b2真实框
        b1: tensor, shape=(batch, anchor_num, feat_w, feat_h, 4), xywh
        b2: tensor, shape=(batch, anchor_num, feat_w, feat_h, 4), xywh

        返回为：
        -------
        out: tensor, shape=(batch, anchor_num, feat_w, feat_h)
        """
        #----------------------------------------------------#
        #   求出预测框左上角右下角
        #----------------------------------------------------#
        b1_xy       = b1[..., :2]
        b1_wh       = b1[..., 2:4]
        b1_wh_half  = b1_wh / 2.
        b1_mins     = b1_xy - b1_wh_half
        b1_maxes    = b1_xy + b1_wh_half
        #----------------------------------------------------#
        #   求出真实框左上角右下角
        #----------------------------------------------------#
        b2_xy       = b2[..., :2]
        b2_wh       = b2[..., 2:4]
        b2_wh_half  = b2_wh / 2.
        b2_mins     = b2_xy - b2_wh_half
        b2_maxes    = b2_xy + b2_wh_half

        #----------------------------------------------------#
        #   求真实框和预测框的iou
        #----------------------------------------------------#
        intersect_mins  = torch.max(b1_mins, b2_mins)
        intersect_maxes = torch.min(b1_maxes, b2_maxes)
        intersect_wh    = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
        intersect_area  = intersect_wh[..., 0] * intersect_wh[..., 1]
        b1_area         = b1_wh[..., 0] * b1_wh[..., 1]
        b2_area         = b2_wh[..., 0] * b2_wh[..., 1]
        union_area      = b1_area + b2_area - intersect_area
        iou             = intersect_area / torch.clamp(union_area,min = 1e-6)

        #----------------------------------------------------#
        #   计算中心的差距
        #   1、中心点的欧式距离/对角线距离
        #----------------------------------------------------#
        center_wh       = b1_xy - b2_xy
        
        #----------------------------------------------------#
        #   找到包裹两个框的最小框的左上角和右下角
        #----------------------------------------------------#
        enclose_mins    = torch.min(b1_mins, b2_mins)
        enclose_maxes   = torch.max(b1_maxes, b2_maxes)
        enclose_wh      = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))

        if self.iou_type == 'ciou':
            #----------------------------------------------------#
            #   计算中心的距离
            #----------------------------------------------------#
            center_distance     = torch.sum(torch.pow(center_wh, 2), axis=-1)
            #----------------------------------------------------#
            #   计算对角线距离
            #----------------------------------------------------#
            enclose_diagonal    = torch.sum(torch.pow(enclose_wh, 2), axis=-1)
            ciou                = iou - 1.0 * (center_distance) / torch.clamp(enclose_diagonal, min = 1e-6)

            # 真实框预测框，宽高比的atan，
            v       = (4 / (math.pi ** 2)) * torch.pow((torch.atan(b1_wh[..., 0] / torch.clamp(b1_wh[..., 1],min = 1e-6)) - torch.atan(b2_wh[..., 0] / torch.clamp(b2_wh[..., 1], min = 1e-6))), 2)
            alpha   = v / torch.clamp((1.0 - iou + v), min = 1e-6)
            out     = ciou - alpha * v
            
        elif self.iou_type == 'siou':
            #----------------------------------------------------#
            #   Angle cost
            #----------------------------------------------------#
            #----------------------------------------------------#
            #   计算中心的距离
            #----------------------------------------------------#
            sigma       = torch.pow(torch.sum(torch.pow(center_wh, 2), axis=-1), 0.5)
            
            #----------------------------------------------------#
            #   求h和w方向上的sin比值
            #----------------------------------------------------#
            sin_alpha_1 = torch.clamp(torch.abs(center_wh[..., 0]) / torch.clamp(sigma, min = 1e-6), min = 0, max = 1)
            sin_alpha_2 = torch.clamp(torch.abs(center_wh[..., 1]) / torch.clamp(sigma, min = 1e-6), min = 0, max = 1)
            
            #----------------------------------------------------#
            #   求门限，二分之根号二，0.707
            #   如果门限大于0.707，代表某个方向的角度大于45°
            #   此时取另一个方向的角度
            #----------------------------------------------------#
            threshold   = pow(2, 0.5) / 2
            sin_alpha   = torch.where(sin_alpha_1 > threshold, sin_alpha_2, sin_alpha_1)

            #----------------------------------------------------#
            #   alpha越接近于45°，angle_cost越接近于1，gamma越接近于1
            #   alpha越接近于0°，angle_cost越接近于0，gamma越接近于2
            #----------------------------------------------------#
            angle_cost  = torch.cos(torch.asin(sin_alpha) * 2 - math.pi / 2)
            gamma       = 2 - angle_cost

            #----------------------------------------------------#
            #   Distance cost
            #   求中心与外包围举行高宽的比值
            #----------------------------------------------------#
            rho_x           = (center_wh[..., 0] / torch.clamp(enclose_wh[..., 0], min = 1e-6)) ** 2
            rho_y           = (center_wh[..., 1] / torch.clamp(enclose_wh[..., 1], min = 1e-6)) ** 2
            distance_cost   = 2 - torch.exp(-gamma * rho_x) - torch.exp(-gamma * rho_y)
            
            #----------------------------------------------------#
            #   Shape cost
            #   真实框与预测框的宽高差异与最大值的比值
            #   差异越小，costshape_cost越小
            #----------------------------------------------------#
            omiga_w     = torch.abs(b1_wh[..., 0] - b2_wh[..., 0]) / torch.clamp(torch.max(b1_wh[..., 0], b2_wh[..., 0]), min = 1e-6)
            omiga_h     = torch.abs(b1_wh[..., 1] - b2_wh[..., 1]) / torch.clamp(torch.max(b1_wh[..., 1], b2_wh[..., 1]), min = 1e-6)
            shape_cost  = torch.pow(1 - torch.exp(-1 * omiga_w), 4) + torch.pow(1 - torch.exp(-1 * omiga_h), 4)
            out         = iou - 0.5 * (distance_cost + shape_cost)
        return out

yolov5总结

在这里插入图片描述

匹配正负样本

1、计算标注框和anchor的宽、高的比值
2、rw和1/rw的最大值，rh和1/rh的最大值，计算标注框和anchor在宽度和高度方向的最大值(相等的时候等于1，差异越小越接近1，不相等的时候结果大于1)
3、统计宽度和高度方向最大的比率
4、rmax < anchor_t=4的化表示匹配成功了
为什么anchor_t=4，因为将网络预测的相对anchor模板的pw,ph缩放因子限制在0-4之间了

在这里插入图片描述
根据标注框中心点落在的这个网格中，还会向上和向左去扩充，目的是为了提高正样本数

3、池化层

经常用的池化操作，即最大池化(Max Pooling)和平均池化(Average Pooling)，则满足下面的公式：
W：为输入图像大小。F：为卷积大小。P：为填充大小。S：为步长。
池化计算公式：(W-F)/S+1

主要功能如下：
1、降低模型计算量
2、解决卷积层对位置敏感问题

卷积层对位置非常敏感，比如说检测垂直边缘的例子
如果我们用一个1*2的卷积核[1,-1]，对像素偏移非常敏感，细微像素偏移就可能导致最终结果0输出
（太剧烈的变化会导致，整个位置太敏感）

在这里插入图片描述

我们需要有一定的平移不变性，所以我们需要有一个池化层
二维最大池化：
每次有一个窗口，在滑动输入，得到输出，每一次将窗口的最大值取出来，由滑动窗口计算输出，直接将窗口最大值输出。

请添加图片描述

最大池化层能不能解决，对位置信息敏感问题？
最大池化层可以允许你的输入发生一定小小偏移，作用在卷积输出上，有一点模糊的效果
2、和卷积类似，都具有填充(padding)和步幅(stride),作用多输入通道时候，对每一个输入通道都做一次池化层，得到他对应的输出通道，不会像卷积可以去融合多输入通道（为什么不做多通道融合？多通道融合可以交给卷积来做）
3、输出通道数=输入通道数

平均池化层：
把每个池化层，平均的强度，拿出来
请添加图片描述

总结：缓解卷积层对位置的敏感性，通常作用卷积层之后，池化层是可以减少计算量，也可能不会减少，看步幅和padding。
为什么现在池化层用的越来越少了？
池化层有两个作用：1、让卷积对位置没那么敏感，2、使用stride=2，让输出减少减少计算量，现在通常使用使用卷积层+stride来减少，因为数据会做大量增强，所以会让卷积层不会过拟合到某一个具体的位置，淡化池化层的作用

4、Residual残差神经网络

模型偏差：加更多的层，虽然模型变得更加复杂了，但是实际上你可能学偏了，可能效果还不如比较小的一个模型

每次增加模型的复杂度，更复杂的模型包含前面的小模型，那么我们学习到的模型就会比前面的更大
（不管g(x)有没有学习到新的值，但是原来的x仍然是有的，能够保证他增大），不管在网络在深，有跳转层，保证下面小的先训练好，在慢慢训练更深的网络

在这里插入图片描述

IOU

公式：交集和并集的比值
优点：
具有尺度不变性
满足非负性
满足对称性
缺点：
1、如果交集为0，无法进行梯度计算。
2、相同IOU反映不出实际预测框与真实框之间的情况。

GIOU

公式：IOU - (|最小外接矩形-两个矩形框的交集|)/C

在这里插入图片描述

入预测框和真实框的最小外接矩形来获取预测框、真实框在闭包区域中的比重，从而解决了两个目标没有交集时梯度为零的问题。

C是两个框的最小外接矩形的面积。原有 IoU 取值区间为 [0,1]，而 GIoU 的取值区间为[-1,1]
在两个图像完全重叠时IoU=GIoU=1，当两个图像不相交的时候IoU=0，GIOU=-1

优点：
与IoU只关注重叠区域不同，GIOU不仅关注重叠区域，还关注其他的非重合区域，能更好的反映两者的重合度；
GIOU是一种IoU的下界，取值范围[ − 1 , 1 ] 。在两者重合的时候取最大值1，在两者无交集且无限远的时候取最小值-1。因此，与IoU相比，GIoU是一个比较好的距离度量指标，解决了不重叠情况下，也就是IOU=0的情况，也能让训练继续进行下去。

缺点：
但是目标框与预测框重叠的情况依旧无法判断

DIoU原理

公式：
在这里插入图片描述

b和bgt分别表示预测框与真实框的中心点坐标，p2(b,bgt)表示两个中心点的欧式距离，C 代表两个图像的最小外接矩形的对角线长度
在这里插入图片描述

优点：
DIoU 可直接最小化两个框之间的距离，所以作为损失函数时 Loss 收敛更快。
与GIoU loss类似，DIoU loss在与⽬标框不重叠时，仍然可以为边界框提供移动⽅向。
在两个框完全上下排列或左右排列时，没有空白区域，此时 GIoU 几乎退化为了 IoU，但是 DIoU 仍然有效。
DIOU还可以替换普通的IOU评价策略，应用于NMS中，使得NMS得到的结果更加合理和有效。

缺点：
DIoU 在完善图像重叠度的计算功能的基础上，实现了对图形距离的考量，但仍无法对图形长宽比的相似性进行很好的表示

CIoU原理

公式：
在这里插入图片描述

其中，α \alphaα 是权重函数，ν \nuν而用来度量长宽比的相似性。计算公式为:
在这里插入图片描述

考虑到bbox回归三要素中的⻓宽⽐还没被考虑到计算中，为此，进⼀步在DIoU的基础上提出了CIoU，同时考虑两个矩形的长宽比，也就是形状的相似性。所以CIOU在DIOU的基础上添加了长宽比的惩罚项。

优点：
更准确的相似性度量：CIOU考虑了边界框的中心点距离和对角线距离，因此可以更准确地衡量两个边界框之间的相似性，尤其是在目标形状和大小不规则的情况下。
鲁棒性更强：相比传统的IoU，CIOU对于目标形状和大小的变化更具有鲁棒性，能够更好地适应各种尺寸和形状的目标检测任务。

缺点：
计算复杂度增加：CIOU引入了额外的中心点距离和对角线距离的计算，因此相比传统的IoU，计算复杂度有所增加，可能会增加一定的计算成本。
实现难度较高：CIOU的计算方式相对复杂，需要对边界框的坐标进行更多的处理和计算，因此在实现上可能会相对困难一些，需要更多的技术和经验支持。

recall(召回率)、precision(精度)

在一个数据集检测中，会产生四类检测结果：
TP、TN 、FP 、FN：

T ——true 表示正确

F——false 表示错误

P—— positive 表示积极的，看成正例

N——negative 表示消极的，看成负例

我的理解：后面为预测结果，前面是预测结果的正确性。如：

T P—— 预测为 P （正例）, 预测对了，本来是正样本，检测为正样本（真阳性）。
T N—— 预测为 N （负例）, 预测对了，本来是负样本，检测为负样本（真阴性）。
F P—— 预测为 P （正例）, 预测错了，本来是负样本，检测为正样本（假阳性）。
F N—— 预测为 N （负例）, 预测错了，本来是正样本，检测为负样本（假阴性）。

TP+FP+TN+FN：样本总数。
TP+FN：实际正样本数。
TP+FP：预测结果为正样本的总数，包括预测正确的和错误的。
FP+TN：实际负样本数。
TN+FN：预测结果为负样本的总数，包括预测正确的和错误的

召回率(Recall):
表示的是样本中的正例有多少被预测正确了（找得全）所有正例中被正确预测出来的比例。
用途：用于评估检测器对所有待检测目标的检测覆盖率

针对数据集中的所有正例(TP+FN)而言，模型正确判断出的正例(TP)占数据集中所有正例的比例。FN表示被模型误认为是负例但实际是正例的数据，召回率也叫查全率，以物体检测为例，我们往往把图片中的物体作为正例，此时召回率高代表着模型可以找出图片中更多的物体
在这里插入图片描述
精确率(Precision):
表示的是预测为正的样本中有多少是真正的正样本（找得对）。预测结果中真正的正例的比例。

用途：用于评估检测器在检测成功基础上的正确率

针对模型判断出的所有正例(TP+FP)而言，其中真正例(TP)占的比例。精确率也叫查准率，还是以物体检测为例，精确率高表示模型检测出的物体中大部分确实是物体，只有少量不是物体的对象被当成物体
在这里插入图片描述
也就是说 recall 表示在整个检测结果中有用部分占整个数据集有用部分的比重，precision 表示在整个检测结果中有用部分占整个检测结果为有用的比重。虽然希望两个指标都能越高越好，但是这两个指标在被某些情况下存在着矛盾

softmax

softmax函数是一种常用于多类分类任务中的激活函数。它的作用是将一组数值映射到0到1之间的概率分布,同时保证所有概率之和为1。

具体来说,给定一个 n 维向量 x = (x1, x2, …, xn)，softmax函数将其转换为 n 维概率向量 p = (p1, p2, …, pn)，其中每个分量 pi 表示样本属于第 i 类的概率。softmax函数的公式如下:

p_i = exp(x_i) / Σ_j exp(x_j)

其中 exp(·) 表示自然指数函数。

softmax函数有几个重要的性质:

非负性: 对于任意 i，pi ≥ 0，因为自然指数函数总是非负的。
归一化: Σ_i p_i = 1，即所有概率之和为1。这确保了softmax输出可以被解释为概率分布。
单调性: 如果 x_i > x_j，则 p_i > p_j。这意味着softmax函数能够保持输入之间的大小关系。
softmax函数广泛应用于深度学习模型的输出层,用于将模型的输出转换为概率分布,从而可以用于多类分类任务。此外,它也可以用于其他机器学习和统计模型中。

总之,softmax函数是一种十分有用的激活函数,在解决多类分类问题时扮演着重要的角色。希望这个解释对您有所帮助。如果还有任何疑问,欢迎继续询问。

"""
假设我们有一个图像分类任务,目标是将输入图像分类为猫、狗或鸟。我们可以使用一个神经网络模型来解决这个问题,该模型的输出层使用softmax函数。

我们假设神经网络的最后一层输出了一个3维向量 x = [2.1, 1.5, 0.8]。这个向量表示网络认为输入图像属于这三类中每一类的"相对可能性"。

接下来,我们将这个向量输入到softmax函数中,得到如下结果:
"""
import numpy as np

x = np.array([2.1, 1.5, 0.8])
p = np.exp(x) / np.sum(np.exp(x))

print(p)