试着了解YOLOx

在特征提取上来说，主干部分使用了focus网络结构，对特征点进行了划分，将特征点信息堆叠到通道上。

同时采用CSPnet结构，在残差网络堆叠的同时，构建大的残差边，经过少量处理直接连接到最后。

过去的YOLO将分类和回归放在一个1*1的卷积中实现，YOLOx认为这给网络的识别带来了很多的不利，在YOLOx中，YOLOHead被划分成了两个部分，分别实现，最后预测的时候将这两个部分结合到一起。

数据增强方面，使用了Mosaic数据增强方法。利用四张图片进行拼接来实现数据中的数据增强。优点是可以用来丰富被检测物体背景。

Anchor Free：基础的Anchor检测器需要对先验框进行聚类，需要很多时间成本。YOLOx中使用的是Anchor Free，解码逻辑更加简单，可读性也强很多。

SimOTA动态这正样本匹配：根据每个真实框与当前各个特征点的预测框的重合程度，计算每个真实框的k，代表每个真实框有k个特征点与之对应。根据真实框和各个特征点的预测准确度和包含情况计算Cost代价矩阵。将Cost最低的k个特征点作为该真实框的正样本。

最重要的部分来了，也就是网络结构：

关注YOLOHead部分，将预测分为两部分，左侧的部分预测的是物体的种类，右侧的部分预测的是是否包含物体以及是否特征点的回归系数（预测框的坐标）。

YOLOx使用到了SiLU激活函数，该激活函数是ReLU和Sigmoid的改进版，可以看做是一个平滑的ReLU激活函数。

使用了SPP结构，通过不同池化核大小的最大池化进行特征提取，提高网络的感受野，在YOLOV4中，SPP结构是用在FPN里面的，在YOLOX中，使用到主干特征提取网络里面的。

（小记一下SPP：

传统的CNN网络结构要求输入图像的尺寸是固定的，所以在训练和推理阶段所有输入的图像都必须经过调整到相同的尺寸，但这个过程必定导致损失部分信息。

SPP网络允许接收不同尺寸的图像，核心思想是通过多层的池化操作来提取不同尺度的特征。

小记一下FPN：

FPN通过构建特征金字塔，有效的利用多尺度特征。包括：主干网络，自底向上的特征提取，，自顶向下的特征融合。

）

改进的点：SiLU在深层模型上是优于ReLU的，以后在多层神经网络结构中遇到ReLU可以使用SiLU替换试一试。

总结而言，在head部分，将之前的共享同一个卷积层用来预测类别置信度和位置改为了分别用不同的卷积层来预测。使用的是anchor-free网络。损失计算上面采用的是分类损失+置信度损失+5*定位损失的和除以被分为正样本的Anchor Point数量。正负样本匹配策略采用SIMOTA，我的理解就是（对于每一个GT，算出对于所有的Anchor Point计算出cost和IOU，根据IOU来计算选择的正样本数，根据cost来选择选取哪些样本作为正样本）。

对yolox的官方代码进行以下讲解：（只关注models部分和loss部分）

首先对于注意力部分，以CAM为例：

定义Mish激活函数，该激活函数是一个相对较新的激活函数，可以看做是ReLU的sigmoid版本，也就是平滑的ReLU函数，在很多网络中表现优于ReLU以及Swish，改进的时候可以试试。

再者就是CAM注意力模块儿，首先将输入的向量进行全局的池化，拼接过程中伴随着通道数的缩小（减少计算量），将得到的结果进行mish处理，然后再做逆向切分，将得到的两个部分都使用sigmoid激活函数的处理，处理后的结果扩展为x的形状，与x做逐元素相乘得到返回的结果。

上代码：

import torch
import torch.nn as nn
import torch.nn.functional as F
class Mish(nn.Module):
    """_summary_

    Args:
        nn (_type_): _description_
    一个相对较新的激活函数，公式是Mish(x)=x*tanh(ln(1+exp(x))))
    优势是平滑且可以训练，但速度较慢。
    在YOLOv5中，使用Mish激活函数代替了ReLU，以获得更好的性能。
    在很多深度学习任务中，表现优于RelU以及Swish，可以用它来试着改进。
    """
    def __init__(self):
        super().__init__()
        print("Mish activation loaded...")
    def forward(self,x):
        x = x * (torch.tanh(F.softplus(x)))
        return x
class CAM(nn.Module):
    """_summary_

    Args:
        nn (_type_): _description_
    
    先将输入特征层进行全局平均池化，后拼接。再进行一个1*1的卷积缩小通道数。使用Mish激活函数来处理。
    将处理后的结果再进行两部分的切分，对切分的结果进行激活处理。切分后的结果再进行通道数的扩张
    将得到的结果扩张到x的形状然后将张量x与扩展后的s_h,s_w逐元素相乘得到结果
    """
    def __init__(self, channels, reduction=32):
        super(CAM, self).__init__()
        self.conv_1x1 = nn.Conv2d(in_channels=channels, out_channels=channels // reduction, kernel_size=1, stride=1,
                                  bias=False)                        
        self.mish = Mish()
        self.bn = nn.BatchNorm2d(channels // reduction)  
        self.F_h = nn.Conv2d(in_channels=channels // reduction, out_channels=channels, kernel_size=1, stride=1, bias=False)
        self.F_w = nn.Conv2d(in_channels=channels // reduction, out_channels=channels, kernel_size=1, stride=1, bias=False)  
        self.sigmoid_h = nn.Sigmoid() 
        self.sigmoid_w = nn.Sigmoid()
 
    def forward(self, x):
        h, w = x.shape[2], x.shape[3]
        #作用是根据输入特征图的大小自动计算池化窗口和步幅，确保输出的特征图大小满足（h,1）
        avg_pool_x = nn.AdaptiveAvgPool2d((h, 1))
        avg_pool_y = nn.AdaptiveAvgPool2d((1, w))
        x_h = avg_pool_x(x).permute(0, 1, 3, 2) 
        x_w = avg_pool_y(x)  
        x_cat_conv_relu = self.mish(self.conv_1x1(torch.cat((x_h, x_w), 3))) #拼接后的形状是(batch_size,channels,1,w+h)
        x_cat_conv_split_h, x_cat_conv_split_w = x_cat_conv_relu.split([h, w], 3)
        s_h = self.sigmoid_h(self.F_h(x_cat_conv_split_h.permute(0, 1, 3, 2)))
        s_w = self.sigmoid_w(self.F_w(x_cat_conv_split_w))
        out = x * s_h.expand_as(x) * s_w.expand_as(x)
        return out

SiLU激活函数的图像如下：

另一些层的源代码，加上注释后如下：

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Copyright (c) Megvii Inc. All rights reserved.

import torch
import torch.nn as nn


class SiLU(nn.Module):
    """
    SilU激活函数其实也就是Swish函数，x = x * sigmoid(x)
    """
    """export-friendly version of nn.SiLU()"""

    @staticmethod
    def forward(x):
        return x * torch.sigmoid(x)


def get_activation(name="silu", inplace=True):
    if name == "silu":
        module = nn.SiLU(inplace=inplace)
    elif name == "relu":
        module = nn.ReLU(inplace=inplace)
    elif name == "lrelu":
        module = nn.LeakyReLU(0.1, inplace=inplace)
    else:
        raise AttributeError("Unsupported act type: {}".format(name))
    return module


class BaseConv(nn.Module):
    """A Conv2d -> Batchnorm -> silu/leaky relu block
    groups表示将输入通道分为groups个组，分别对每一组进行卷积操作，卷积后的结果进行拼接得到输出结果
    经输入进行二维卷积，输出做归一化处理然后激活
    简单来讲就是卷积+归一化+激活
    其中fuseforward不对输出进行归一化处理，而是直接激活
    """

    def __init__(
        self, in_channels, out_channels, ksize, stride, groups=1, bias=False, act="silu"
    ):
        super().__init__()
        # same padding
        pad = (ksize - 1) // 2
        self.conv = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size=ksize,
            stride=stride,
            padding=pad,
            groups=groups,
            bias=bias,
        )
        #表示对out_channels个通道进行归一化处理
        self.bn = nn.BatchNorm2d(out_channels)
        self.act = get_activation(act, inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def fuseforward(self, x):
        return self.act(self.conv(x))


class DWConv(nn.Module):
    """Depthwise Conv + Conv
    对输入进行卷积+归一化+激活（此时不修改通道数），得到的结果再进行卷积（1*1，用于修改通道数）+
    归一化+激活
    相当于重复了两个BaseConv
    """

    def __init__(self, in_channels, out_channels, ksize, stride=1, act="silu"):
        super().__init__()
        self.dconv = BaseConv(
            in_channels,
            in_channels,
            ksize=ksize,
            stride=stride,
            groups=in_channels,
            act=act,
        )
        self.pconv = BaseConv(
            in_channels, out_channels, ksize=1, stride=1, groups=1, act=act
        )

    def forward(self, x):
        x = self.dconv(x)
        return self.pconv(x)


class Bottleneck(nn.Module):
    """_summary_

    Args:
        nn (_type_): _description_

    Returns:
        _type_: _description_
    隐藏层的作用是为了减少计算量
    根据参数来选择是否使用深度卷积，对输入的特征层首先进行基础卷积，然后选择性使用深度卷积
    通过参数选择是否进行残差连接
    """
    # Standard bottleneck
    def __init__(
        self,
        in_channels,
        out_channels,
        shortcut=True,
        expansion=0.5,
        depthwise=False,
        act="silu",
    ):
        super().__init__()
        hidden_channels = int(out_channels * expansion)
        Conv = DWConv if depthwise else BaseConv#根据depthwise参数来选择是使用基础卷积还是深度卷积（基础卷积*2）
        self.conv1 = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)#在Base_Conv中有定义padding，保证处理后特征层的形状不发生改变
        self.conv2 = Conv(hidden_channels, out_channels, 3, stride=1, act=act)
        self.use_add = shortcut and in_channels == out_channels

    def forward(self, x):
        y = self.conv2(self.conv1(x))
        if self.use_add:
            y = y + x
        return y


class ResLayer(nn.Module):
    "Residual layer with `in_channels` inputs."
    """
    残差层对输入使用两个卷积层（通过隐藏层来减小计算量）
    将输入经过两个卷积层得到的结果与原始输入进行相加
    """

    def __init__(self, in_channels: int):
        super().__init__()
        mid_channels = in_channels // 2
        self.layer1 = BaseConv(
            in_channels, mid_channels, ksize=1, stride=1, act="lrelu"
        )
        self.layer2 = BaseConv(
            mid_channels, in_channels, ksize=3, stride=1, act="lrelu"
        )

    def forward(self, x):
        out = self.layer2(self.layer1(x))
        return x + out


class SPPBottleneck(nn.Module):
    """Spatial pyramid pooling layer used in YOLOv3-SPP"""
    """
    首先将输入送到一个卷积层，然后使用三个池化层，每个池化层的核大小为5，9，13，
    然后对三个池化层得到的结果进行拼接，加上原始输入通过第一个卷积层得到的结果，再送入一个卷积层
    俗称空间金字塔（包含两个卷积层和多个池化层）
    """

    def __init__(
        self, in_channels, out_channels, kernel_sizes=(5, 9, 13), activation="silu"
    ):
        super().__init__()
        hidden_channels = in_channels // 2
        self.conv1 = BaseConv(in_channels, hidden_channels, 1, stride=1, act=activation)
        self.m = nn.ModuleList(
            [
                #padding=ks//2的作用是保证输入和输出的空间尺寸一致
                nn.MaxPool2d(kernel_size=ks, stride=1, padding=ks // 2)
                for ks in kernel_sizes
            ]
        )
        conv2_channels = hidden_channels * (len(kernel_sizes) + 1)
        self.conv2 = BaseConv(conv2_channels, out_channels, 1, stride=1, act=activation)

    def forward(self, x):
        x = self.conv1(x)
        x = torch.cat([x] + [m(x) for m in self.m], dim=1)
        x = self.conv2(x)
        return x


class CSPLayer(nn.Module):
    """C3 in yolov5, CSP Bottleneck with 3 convolutions"""

    """
    将原始输入分别进入两个基础卷积，得到结果1和2，对1进行n个深层卷积，得到的结果与2拼接，拼接的结果进入另一个卷积层
    """
    def __init__(
        self,
        in_channels,
        out_channels,
        n=1,
        shortcut=True,
        expansion=0.5,
        depthwise=False,
        act="silu",
    ):
        """
        Args:
            in_channels (int): input channels.
            out_channels (int): output channels.
            n (int): number of Bottlenecks. Default value: 1.
        """
        # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        hidden_channels = int(out_channels * expansion)  # hidden channels
        self.conv1 = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        self.conv2 = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        self.conv3 = BaseConv(2 * hidden_channels, out_channels, 1, stride=1, act=act)
        module_list = [#选择使用n个深层残差网络形成空间金字塔
            Bottleneck(
                hidden_channels, hidden_channels, shortcut, 1.0, depthwise, act=act
            )
            for _ in range(n)
        ]
        #Sequential是一个容器模块，将多个子模块儿按顺序组合在一起，为了更加清晰
        self.m = nn.Sequential(*module_list)

    def forward(self, x):
        x_1 = self.conv1(x)
        x_2 = self.conv2(x)
        x_1 = self.m(x_1)
        x = torch.cat((x_1, x_2), dim=1)
        return self.conv3(x)


class Focus(nn.Module):
    """Focus width and height information into channel space."""
    """
    对图像进行间隔切片后对四个区域进行通道维度的拼接，再送入一个卷积层，得到的结果是形状上变为原来的1/4
    """
    def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
        super().__init__()
        self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)

    def forward(self, x):
        """
        切片操作：以第一个为例，前两个维度不变，从第三个维度开始，在该维度上每隔两个元素去一个点
        第二行表示，每隔一定的元素数量取点的时候，索引从1开始（default表示从0开始）

        Args:
            x (_type_): _description_

        Returns:
            _type_: _description_
        """
        # shape of x (b,c,w,h) -> y(b,4c,w/2,h/2)
        patch_top_left = x[..., ::2, ::2]
        patch_top_right = x[..., ::2, 1::2]
        patch_bot_left = x[..., 1::2, ::2]
        patch_bot_right = x[..., 1::2, 1::2]
        x = torch.cat(
            (
                patch_top_left,
                patch_bot_left,
                patch_top_right,
                patch_bot_right,
            ),
            dim=1,
        )
        return self.conv(x)