​ 本篇文章的目的是记录自己实现yolo v1的过程,在此过程中,参考了许多开源的代码和博客,赞美大佬们。


YOLO v1代码参考:(读书人的事情,怎么能说抄了,是借鉴借鉴<_<)
	1. torch.nonzero用法:
	2. torch.clamp用法
	3. numel函数用法:
	4. squeeze与unsqueeze用法:
	5. torch.sort用法:
	6. requires_grad理解:



    • 基于pytorch实现YOLOv1(长长文)
      • 0. 本代码使用方法
      • 1. 前期准备
        • 1.1 目录结构
        • 1.2 数据集介绍
        • 1.3 YOLOv1介绍
        • 1.4 说明
      • 2. backbone实现
        • 2.1 ResNet架构介绍
        • 2.2 导入需要的包
        • 2.3 构建ResNet
        • 2.4 完整代码
      • 3. 数据集处理
      • 4. 数据加载器
        • 4.1 导包
        • 4.2 构建数据加载器
        • 4.3 调试代码
        • 4.4 完整代码
      • 5. 损失函数
        • 5.1 导包
        • 5.2 构建损失函数
        • 5.3 完整代码
      • 6. 训练
        • 6.1 导包
        • 6.2 实现训练代码
        • 6.3 训练过程展示
        • 6.4 完整代码
      • 7. 测试
        • 7.1 导包
        • 7.2 实现预测代码
        • 7.3 预测结果展示
        • 7.4 完整代码
      • 8. 代码与权重下载地址
      • 9. 总结

0. 本代码使用方法


​ 首先,通过文章末尾的下载地址,把代码、权重文件下载下来,然后通过1.2节中的博客中的链接把数据下载下来。下载完后的结果如下图:



​ 下载完后,里面的各个文件夹的含义如下:

networkCNN Backbone存放位置

​ 其中,关于权重文件,我需要说明的是:我自己单机单卡,batch_size=2,epoch=50,在默认参数的情况下,训练了一整天才训练完,但是感觉选的优化器、学习率参数没有设置好,所以训练得效果并不是特别理想,你看7.3节中的测试结果录屏,可以发现有些对象没有检测到,有些对象检测错误等

​ 好的,当准备好所有东西后,打开predict.py这个文件,翻到下面最后几行,做出如下的修改:


​ 选择好数据集后,可以直接运行predict.py这个文件,就可以看到如下的预测结果:


1. 前期准备


1.1 目录结构

​ 我的目录结构如下图所示:


​ 其中,需要说明的是我的数据集文件夹在上一层目录中,但是大家用的时候可以把它放于本层目录中,然后再在代码中修改路径即可。

1.2 数据集介绍

​ 本次采用的数据集还是VOC数据集,关于VOC数据集的介绍,可以看我这篇的博客:点击即可传送。

1.3 YOLOv1介绍

​ 关于YOLOv1的论文解读网站上有非常多,这里大家可以看我的,但是更建议看大佬们的:

【论文解读】Yolo三部曲解读——Yolov1 :

1.4 说明

  • 对于重要的代码部分,我会专门一步步讲解,但是不重要的部分,大家看我写的注释即可。
  • 想要快速使用本代码,跳转到文章最后即可(包括代码、权重文件的下载路径)

2. backbone实现


​ 这个文件名为My_ResNet.py位于network文件夹内。

2.1 ResNet架构介绍


​ 首先,需要放一张ResNet的架构图:



​ 那么,我们就根据上表格和图片来实现ResNet。

2.2 导入需要的包

# 1. 导入所需要的包
import torch
import math
from torch import nn
import torch.utils.model_zoo as model_zoo  # 根据url加载模型
import torch.nn.functional as F

2.3 构建ResNet


​ 这个基础block即类似下图的架构:


​ 不过需要注意的是,ResNet18—ResNet152共有两种类型的Block块,如下图,一种不涉及1*1卷积网络,另外一种涉及:


​ 因此,在实现的时候,也有两个Block,实现的代码如下:

# 2. 构建block: 不含有1*1
class Base_Block(nn.Module):
    # 用于扩充的变量,表示扩大几倍
    expansion = 1
    def __init__(self,in_planes,out_planes,stride=1,downsample=None):
        :param in_planes: 输入的通道数
        :param planes: 输出的通道数
        :param stride: 默认步长
        :param downsample: 是否进行下采样
        super(Base_Block, self).__init__()
        # 定义网络结构 + 初始化参数
        self.conv1 = nn.Conv2d(in_planes,out_planes,kernel_size=3,stride=stride,padding=1,bias=False)
        self.bn1 = nn.BatchNorm2d(out_planes)
        self.relu = nn.ReLU(inplace=True) # inplace为True表示直接改变原始参数值
        self.conv2 = nn.Conv2d(out_planes,out_planes,kernel_size=3,stride=1,padding=1,bias=False)
        self.bn2 = nn.BatchNorm2d(out_planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self,x):
        # 前向传播
        res = x  #  残差
        # 正常传播
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        # 判断是否下采样
        if self.downsample is not None:
            res = self.downsample(res)
        # 相加
        out += res
        # 返回结果
        out = self.relu(out)
        return out

# 3. 构建Block:含有1*1
class Senior_Block(nn.Module):
    expansion = 4
    def __init__(self,in_planes,planes,stride=1,downsample=None):
        :param in_planes: 输入通道数
        :param planes: 中间通道数,最终的输出通道数还需要乘以扩大系数,即expansion
        :param stride: 步长
        :param downsample: 下采样方法
        super(Senior_Block, self).__init__()
        self.conv1 = nn.Conv2d(in_planes,planes,kernel_size=1,bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes,planes,kernel_size=3,stride=stride,padding=1,bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes*4)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride
    def forward(self,x):
        # 残差
        res = x
        # 前向传播
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        out = self.conv3(out)
        out = self.bn3(out)
        # 是否下采样
        if self.downsample is not None:
            res = self.downsample(x)
        # 相加
        out += res
        out = self.relu(out)
        return out


​ 这一点需要特别注意,因为我们实现的是YOLOv1,其要求我们的最终输出为7*7*30的形式,而不是常规的分类输出,因此对于原始的ResNet,我们需要删除最后的全连接层,并且修改输出形式。关于这一点,我用我之前实现的ResNet分类代码进行了调试,发现在经历最后的全连接前,其输出形式为:


因此,这意味着我们仅仅需要修改ResNet最后的输出,那么我们按照ResNet Block的形式,为它专门添加一个output_block即可,思路如下图:


​ 具体的代码实现如下:

# 4. 构建输出层
class Output_Block(nn.Module):
    expansion = 1
    def __init__(self, in_planes, planes, stride=1, block_type='A'):
        :param in_planes: 输入通道数
        :param planes:  中间通道数
        :param stride: 步长
        :param block_type: block类型,为A表示不需要下采样,为B则需要
        super(Output_Block, self).__init__()
        # 定义卷积
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=2, bias=False,dilation=2)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)
        # 判断是否需要下采样,相比于普通的判断方式,多了一个block类型
        self.downsample = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes or block_type=='B':
            self.downsample = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),

    def forward(self, x):
        # 前向传播
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        # 相加与下采样
        out += self.downsample(x)
        out = F.relu(out)
        return out


​ 下面,我们实现ResNet这个类,这个类的整体构建思路就是最开始的表格那样,一步步搭建即可。

​ 具体代码如下:

# 5. 构建ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers):
        :param block:  即基本的Block块对象
        :param layers:  指的是创建的何种ResNet,以及其对应的各个层的个数,比如ResNet50,传入的就是[3, 4, 6, 3]
        super(ResNet, self).__init__()
        # 最开始的通道数,为64
        self.inplanes = 64
        # 最开始大家都用到的卷积层和池化层
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        # 开始定义不同的block块
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        # 不要忘记我们最后定义的output_block
        self.layer5 = self._make_out_layer(in_channels=2048)
        # 接上最后的卷积层即可,将输出变为30个通道数,shape为7*7*30
        self.avgpool = nn.AvgPool2d(2)  # kernel_size = 2  , stride = 2
        self.conv_end = nn.Conv2d(256, 30, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn_end = nn.BatchNorm2d(30)
        # 进行参数初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
      , math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):

    # 根据传入的layer个数和block创建
    def _make_layer(self, block, planes, blocks, stride=1):
        :param block: Block对象
        :param planes:  输入的通道数
        :param blocks: 即需要搭建多少个一样的块
        :param stride: 步长
        # 初始化下采样变量
        downsample = None
        # 判断是否需要进行下采样,即根据步长或者输入与输出通道数是否匹配
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                # 如果需要下采样,目的肯定是残差和输出可以加在一起
                nn.Conv2d(self.inplanes, planes * block.expansion,kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
        # 开始创建
        layers = []
        # 第一个block需要特别处理:
        # 比如第一个,传入的channel为512,但是最终的输出为256,那么是需要下采样的
        # 但是对于第二个block块,传入的肯定是第一个的输出即256,而最终输出也为256,因此不需要下采样
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            # 重复指定次数
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)  # *表示解码,即将列表解开为一个个对象

    # 输出层的构建
    def _make_out_layer(self, in_channels):
        layers = []
        # 根据需求,构建出类似与block的即可
        layers.append(Output_Block(in_planes=in_channels, planes=256, block_type='B'))
        layers.append(Output_Block(in_planes=256, planes=256, block_type='A'))
        layers.append(Output_Block(in_planes=256, planes=256, block_type='A'))
        return nn.Sequential(*layers)

    def forward(self, x):
        # 经历共有的卷积和池化层
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        # 经历各个block块
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.layer5(x)
        # 经历最终的输出
        x = self.avgpool(x)
        x = self.conv_end(x)
        x = self.bn_end(x)
        x = F.sigmoid(x)  # 归一化到0-1
        # 将输出构建为正确的shape
        x = x.permute(0, 2, 3, 1)  # (-1,7,7,30)
        return x


​ 这一块就很简单了,只需要定义不同的函数来创建不同的ResNet即可,代码如下:

# 6. 构建不同的ResNet函数
# 预训练下载链接
model_urls = {
    'resnet18': '',
    'resnet34': '',
    'resnet50': '',
    'resnet101': '',
    'resnet152': '',
# 构建ResNet18
def resnet18(pretrained=False, **kwargs):
    model = ResNet(Base_Block, [2, 2, 2, 2], **kwargs)
    # 是否预训练
    if pretrained:
    return model
# 构建ResNet34
def resnet34(pretrained=False, **kwargs):
    model = ResNet(Base_Block, [3, 4, 6, 3], **kwargs)
    if pretrained:
    return model
# 构建ResNet50
def resnet50(pretrained=False, **kwargs):
    model = ResNet(Senior_Block, [3, 4, 6, 3], **kwargs)
    if pretrained:
    return model
# 构建ResNet101
def resnet101(pretrained=False, **kwargs):
    model = ResNet(Senior_Block, [3, 4, 23, 3], **kwargs)
    if pretrained:
    return model
# 构建ResNet152
def resnet152(pretrained=False, **kwargs):
    model = ResNet(Senior_Block, [3, 8, 36, 3], **kwargs)
    if pretrained:
    return model

2.4 完整代码

​ 完整代码如下:

# author: baiCai
# 1. 导入所需要的包
import torch
import math
from torch import nn
import torch.utils.model_zoo as model_zoo
import torch.functional as F

# 2. 构建block: 不含有1*1
class Base_Block(nn.Module):
    # 用于扩充的变量,表示扩大几倍
    expansion = 1
    def __init__(self,in_planes,out_planes,stride=1,downsample=None):
        :param in_planes: 输入的通道数
        :param planes: 输出的通道数
        :param stride: 默认步长
        :param downsample: 是否进行下采样
        super(Base_Block, self).__init__()
        # 定义网络结构 + 初始化参数
        self.conv1 = nn.Conv2d(in_planes,out_planes,kernel_size=3,stride=stride,padding=1,bias=False)
        self.bn1 = nn.BatchNorm2d(out_planes)
        self.relu = nn.ReLU(inplace=True) # inplace为True表示直接改变原始参数值
        self.conv2 = nn.Conv2d(out_planes,out_planes,kernel_size=3,stride=1,padding=1,bias=False)
        self.bn2 = nn.BatchNorm2d(out_planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self,x):
        # 前向传播
        res = x  #  残差
        # 正常传播
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        # 判断是否下采样
        if self.downsample is not None:
            res = self.downsample(res)
        # 相加
        out += res
        # 返回结果
        out = self.relu(out)
        return out

# 3. 构建Block:含有1*1
class Senior_Block(nn.Module):
    expansion = 4
    def __init__(self,in_planes,planes,stride=1,downsample=None):
        :param in_planes: 输入通道数
        :param planes: 中间通道数,最终的输出通道数还需要乘以扩大系数,即expansion
        :param stride: 步长
        :param downsample: 下采样方法
        super(Senior_Block, self).__init__()
        self.conv1 = nn.Conv2d(in_planes,planes,kernel_size=1,bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes,planes,kernel_size=3,stride=stride,padding=1,bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes*4)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self,x):
        # 残差
        res = x
        # 前向传播
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        out = self.conv3(out)
        out = self.bn3(out)
        # 是否下采样
        if self.downsample is not None:
            res = self.downsample(x)
        # 相加
        out += res
        out = self.relu(out)
        return out

# 4. 构建输出层
class Output_Block(nn.Module):
    expansion = 1
    def __init__(self, in_planes, planes, stride=1, block_type='A'):
        :param in_planes: 输入通道数
        :param planes:  中间通道数
        :param stride: 步长
        :param block_type: block类型,为A表示不需要下采样,为B则需要
        super(Output_Block, self).__init__()
        # 定义卷积
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=2, bias=False,dilation=2)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)
        # 判断是否需要下采样,相比于普通的判断方式,多了一个block类型
        self.downsample = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes or block_type=='B':
            self.downsample = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),

    def forward(self, x):
        # 前向传播
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        # 相加与下采样
        out += self.downsample(x)
        out = F.relu(out)
        return out

# 5. 构建ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers):
        :param block:  即基本的Block块对象
        :param layers:  指的是创建的何种ResNet,以及其对应的各个层的个数,比如ResNet50,传入的就是[3, 4, 6, 3]
        super(ResNet, self).__init__()
        # 最开始的通道数,为64
        self.inplanes = 64
        # 最开始大家都用到的卷积层和池化层
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        # 开始定义不同的block块
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        # 不要忘记我们最后定义的output_block
        self.layer5 = self._make_out_layer(in_channels=2048)
        # 接上最后的卷积层即可,将输出变为30个通道数,shape为7*7*30
        self.conv_end = nn.Conv2d(256, 30, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn_end = nn.BatchNorm2d(30)
        # 进行参数初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
      , math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):

    # 根据传入的layer个数和block创建
    def _make_layer(self, block, planes, blocks, stride=1):
        :param block: Block对象
        :param planes:  输入的通道数
        :param blocks: 即需要搭建多少个一样的块
        :param stride: 步长
        # 初始化下采样变量
        downsample = None
        # 判断是否需要进行下采样,即根据步长或者输入与输出通道数是否匹配
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                # 如果需要下采样,目的肯定是残差和输出可以加在一起
                nn.Conv2d(self.inplanes, planes * block.expansion,kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
        # 开始创建
        layers = []
        # 第一个block需要特别处理:
        # 比如第一个,传入的channel为512,但是最终的输出为256,那么是需要下采样的
        # 但是对于第二个block块,传入的肯定是第一个的输出即256,而最终输出也为256,因此不需要下采样
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            # 重复指定次数
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)  # *表示解码,即将列表解开为一个个对象

    # 输出层的构建
    def _make_out_layer(self, in_channels):
        layers = []
        # 根据需求,构建出类似与block的即可
        layers.append(Output_Block(in_planes=in_channels, planes=256, block_type='B'))
        layers.append(Output_Block(in_planes=256, planes=256, block_type='A'))
        layers.append(Output_Block(in_planes=256, planes=256, block_type='A'))
        return nn.Sequential(*layers)

    def forward(self, x):
        # 经历共有的卷积和池化层
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        # 经历各个block块
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.layer5(x)
        # 经历最终的输出
        x = self.conv_end(x)
        x = self.bn_end(x)
        x = F.sigmoid(x)  # 归一化到0-1
        # 将输出构建为正确的shape
        x = x.permute(0, 2, 3, 1)  # (-1,7,7,30)
        return x

# 6. 构建不同的ResNet函数
# 预训练下载链接
model_urls = {
    'resnet18': '',
    'resnet34': '',
    'resnet50': '',
    'resnet101': '',
    'resnet152': '',
# 构建ResNet18
def resnet18(pretrained=False, **kwargs):
    model = ResNet(Base_Block, [2, 2, 2, 2], **kwargs)
    # 是否预训练
    if pretrained:
    return model
# 构建ResNet34
def resnet34(pretrained=False, **kwargs):
    model = ResNet(Base_Block, [3, 4, 6, 3], **kwargs)
    if pretrained:
    return model
# 构建ResNet50
def resnet50(pretrained=False, **kwargs):
    model = ResNet(Senior_Block, [3, 4, 6, 3], **kwargs)
    if pretrained:
    return model
# 构建ResNet101
def resnet101(pretrained=False, **kwargs):
    model = ResNet(Senior_Block, [3, 4, 23, 3], **kwargs)
    if pretrained:
    return model
# 构建ResNet152
def resnet152(pretrained=False, **kwargs):
    model = ResNet(Senior_Block, [3, 8, 36, 3], **kwargs)
    if pretrained:
    return model

3. 数据集处理

我们需要把数据集处理为txt文件,其中保存格式通常为image_name.jpg x y w h c x y w h c(这张图片中有两个目标)。

​ 这个文件名为generate_txt_file.py位于utils文件夹内。

​ 这个功能的实现比较简单,具体思路如下:

1. 打开两个文件,一个用于train、一个用于test
2. 接着不停地读取xml文件,去解析xml文件,并将对应的信息存入txt文件中,每个xml文件对应一行txt内容

​ 具体代码如下:

# author: baiCai
# 1. 导包
from xml.etree import ElementTree as ET
import os
import random

# 2. 定义一些基本的参数
# 定义所有的类名
    'aeroplane', 'bicycle', 'bird', 'boat',
    'bottle', 'bus', 'car', 'cat', 'chair',
    'cow', 'diningtable', 'dog', 'horse',
    'motorbike', 'person', 'pottedplant',
    'sheep', 'sofa', 'train', 'tvmonitor')
# 训练集和测试集文件名字
train_set = open('voctrain.txt', 'w')
test_set = open('voctest.txt', 'w')
# 要读取的xml文件路径,记得自己修改路径
Annotations = '../../data/VOC2012/Annotations/'
# 列出所有的xml文件
xml_files = os.listdir(Annotations)
# 打乱数据集
# 训练集数量
train_num = int(len(xml_files) * 0.7)
# 训练列表
train_lists = xml_files[:train_num]
# 测测试列表
test_lists = xml_files[train_num:]

# 3. 定义解析xml文件的函数
def parse_rec(filename):
    # 参数:输入xml文件名
    # 创建xml对象
    tree = ET.parse(filename)
    objects = []
    # 迭代读取xml文件中的object节点,即物体信息
    for obj in tree.findall('object'):
        obj_struct = {}
        # difficult属性,即这里不需要那些难判断的对象
        difficult = int(obj.find('difficult').text)
        if difficult == 1:  # 若为1则跳过本次循环
        # 开始收集信息
        obj_struct['name'] = obj.find('name').text
        bbox = obj.find('bndbox')
        obj_struct['bbox'] = [int(float(bbox.find('xmin').text)),

    return objects

# 4. 把信息保存入文件中
def write_txt():
    # # 生成训练集txt
    count = 0
    for train_list in train_lists:
        count += 1
        # 获取图片名字
        image_name = train_list.split('.')[0] + '.jpg'  # 图片文件名
        # 对他进行解析
        results = parse_rec(Annotations + train_list)
        # 如果返回的对象为空,表示张图片难以检测,因此直接跳过
        if len(results) == 0:
        # 否则,则写入文件中
        # 先写入文件名字
        # 接着指定下面写入的格式
        for result in results:
            class_name = result['name']
            bbox = result['bbox']
            class_name = VOC_CLASSES.index(class_name)
            train_set.write(' ' + str(bbox[0]) +
                            ' ' + str(bbox[1]) +
                            ' ' + str(bbox[2]) +
                            ' ' + str(bbox[3]) +
                            ' ' + str(class_name))
    # 生成测试集txt
    # 原理同上面
    for test_list in test_lists:
        count += 1
        image_name = test_list.split('.')[0] + '.jpg'  # 图片文件名
        results = parse_rec(Annotations + test_list)
        if len(results) == 0:
        for result in results:
            class_name = result['name']
            bbox = result['bbox']
            class_name = VOC_CLASSES.index(class_name)
            test_set.write(' ' + str(bbox[0]) +
                            ' ' + str(bbox[1]) +
                            ' ' + str(bbox[2]) +
                            ' ' + str(bbox[3]) +
                            ' ' + str(class_name))

# 5. 运行
if __name__ == '__main__':

​ 这段代码需要自己运行一次,结果如下:


4. 数据加载器

接下来构建数据加载器,常见的目标检测的数据加载器有两类,一类是直接读取xml文件(注释文件),另外一类更为常见,是将xml文件里面的内容,保存到txt文件中,其中保存格式通常为image_name.jpg x y w h c x y w h c(这张图片中有两个目标),这里我们采用第二种方式构建数据加载器。

4.1 导包

# 1. 导包
import os
import random
import numpy as np
import torch
import torchvision.transforms as T
from import Dataset
import cv2

4.2 构建数据加载器


​ 常规步骤,先写出整体的Dataset框架,方便理清思路:

# 2. 构建数据加载器
class Yolo_Dataset(Dataset):
    # 默认图片大小
    image_size = 448
    def __init__(self):
    def __len__(self):
    def __getitem__(self, idx):


​ 主要做的事情是:读取txt文件里面的内容并定义一些基本的参数值。代码如下:

def __init__(self,root,list_file,train=True,transforms=None):
        :param root:  根目录,比如`./data/`,这个目录必须包含真正的图片
        :param list_file: 即我们之前使用generate_txt_file生成的txt文件,这里刚好和本文件在同一目录
        :param train: 是否为训练集,默认为True
        :param transforms: 预处理方法,默认为None
    # 初始化各个参数
    self.root = root
    self.train = train
    self.transform = transforms
    self.fnames = []		# 用于存储文件名
    self.boxes = []		# 用于存储box信息
    self.labels = []	# 用于存储类别信息
    self.mean = (123, 117, 104)  # RGB平均值
    # 打开文件,读取内容
    with open(list_file) as f:
        lines  = f.readlines()
	# 针对每一行进行处理
    for line in lines:
        splited = line.strip().split()
        self.fnames.append(splited[0])   # 文件名字
        # 读取文件的类别和box信息
        # 每五个数据表示一个对象,因此下面都是关于5的处理
        num_boxes = (len(splited) - 1) // 5
        for i in range(num_boxes):
            x = float(splited[1+5*i])
            y = float(splited[2+5*i])
            x2 = float(splited[3+5*i])
            y2 = float(splited[4+5*i])
            c = splited[5+5*i]
	# 记录一下对象个数
	self.num_samples = len(self.boxes)


​ 这个方法简单,就是定义我们调用len(obj)时返回的值,这里我们返回对象个数,那么代码如下:

def __len__(self):
    # 返回对象个数
    return self.num_samples


​ geiitem方法主要需要注意的是,需要对一般的输出进行处理,转为yolov1需要的输出类型,即7*7*30。

​ 代码如下:

def __getitem__(self, idx):
    # 随便获取一张图片
    fname = self.fnames[idx]
    # 打开图片,记得把路径拼接完整
    img = cv2.imread(os.path.join(self.root + fname))
    # 获取图片的相关信息
    boxes = self.boxes[idx].clone()
    labels = self.labels[idx].clone()
    # 如果是训练模式,需要进行图像的增强
    # 需要注意的是,同时处理图像和box
    if self.train:
        img, boxes = self.random_flip(img, boxes)   # 随机翻转
        img, boxes = self.randomScale(img, boxes)   # 随机缩放
        img = self.randomBlur(img)      # 随机模糊
        img = self.RandomBrightness(img)    # 随机调整亮度
        img = self.RandomHue(img)   # 随机调整色调
        img = self.RandomSaturation(img) # 随机调整饱和度
        img, boxes, labels = self.randomShift(img, boxes, labels)   # 随机移动
        img, boxes, labels = self.randomCrop(img, boxes, labels)    # 随机裁剪
    # 进行基本的处理
    h, w, _ = img.shape
    # 归一化
    boxes /= torch.Tensor([w, h, w, h]).expand_as(boxes)    
    # 主要是CV2打开模式默认为BGR,因此需要转为RGB模式
    img = self.BGR2RGB(img)  
    # 减去均值
    img = self.subMean(img, self.mean)  
    # 将图片缩放到指定大小,即448*448
    img = cv2.resize(img, (self.image_size, self.image_size))
    # 需要特别处理,将各种信息变为yolov1需要的7*7*30
    target = self.encoder(boxes, labels)  # 7x7x30
    # 最后,进行预处理操作
    for t in self.transform:
        img = t(img)

    return img, target


def encoder(self,boxes,labels):
    boxes (tensor) [[x1,y1,x2,y2],[]]
    labels (tensor) [...]
    return 7x7x30
    grid_num = 7
    # 先创建一个全为0的张量,后面进行填充即可
    target = torch.zeros((grid_num,grid_num,30))
    # 缩放因子
    cell_size = 1./grid_num
    # 计算出w、h和中心点坐标
    wh = boxes[:,2:]-boxes[:,:2]
    cxcy = (boxes[:,2:]+boxes[:,:2])/2
    for i in range(cxcy.size()[0]):
        cxcy_sample = cxcy[i]       # 中心坐标
        ij = (cxcy_sample/cell_size).ceil()-1 # 左上角坐标,需要乘以缩放因子得到归一化后的坐标
        target[int(ij[1]),int(ij[0]),4] = 1
        target[int(ij[1]),int(ij[0]),9] = 1
        target[int(ij[1]),int(ij[0]),int(labels[i])+9] = 1
        # 匹配到的网格的左上角相对坐标
        xy = ij*cell_size
        # 相对偏移量
        delta_xy = (cxcy_sample -xy)/cell_size
        target[int(ij[1]),int(ij[0]),2:4] = wh[i]
        target[int(ij[1]),int(ij[0]),:2] = delta_xy
        target[int(ij[1]),int(ij[0]),7:9] = wh[i]
        target[int(ij[1]),int(ij[0]),5:7] = delta_xy
    return target

​ 只是需要注意的是,上述方法就是将图像box、label信息,压缩为一个7*7*30的向量,其中向量的内容为:


​ 而上述代码中涉及的图像增强代码,直接拷贝别人的,因为这段代码和官方实现的差别在于多考虑一个box信息,即同时对图像和box信息处理:

# 下面是各种预处理算法,来自别人的代码直接拷贝过来的
def BGR2RGB(self, img):
    return cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

def BGR2HSV(self, img):
    return cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

def HSV2BGR(self, img):
    return cv2.cvtColor(img, cv2.COLOR_HSV2BGR)

def RandomBrightness(self, bgr):
    if random.random() < 0.5:
        hsv = self.BGR2HSV(bgr)
        h, s, v = cv2.split(hsv)
        adjust = random.choice([0.5, 1.5])
        v = v * adjust
        v = np.clip(v, 0, 255).astype(hsv.dtype)
        hsv = cv2.merge((h, s, v))
        bgr = self.HSV2BGR(hsv)
    return bgr

def RandomSaturation(self, bgr):
    if random.random() < 0.5:
        hsv = self.BGR2HSV(bgr)
        h, s, v = cv2.split(hsv)
        adjust = random.choice([0.5, 1.5])
        s = s * adjust
        s = np.clip(s, 0, 255).astype(hsv.dtype)
        hsv = cv2.merge((h, s, v))
        bgr = self.HSV2BGR(hsv)
    return bgr

def RandomHue(self, bgr):
    if random.random() < 0.5:
        hsv = self.BGR2HSV(bgr)
        h, s, v = cv2.split(hsv)
        adjust = random.choice([0.5, 1.5])
        h = h * adjust
        h = np.clip(h, 0, 255).astype(hsv.dtype)
        hsv = cv2.merge((h, s, v))
        bgr = self.HSV2BGR(hsv)
    return bgr

def randomBlur(self, bgr):
    if random.random() < 0.5:
        bgr = cv2.blur(bgr, (5, 5))
    return bgr

def randomShift(self, bgr, boxes, labels):
    # 平移变换
    center = (boxes[:, 2:] + boxes[:, :2]) / 2
    if random.random() < 0.5:
        height, width, c = bgr.shape
        after_shfit_image = np.zeros((height, width, c), dtype=bgr.dtype)
        after_shfit_image[:, :, :] = (104, 117, 123)  # bgr
        shift_x = random.uniform(-width * 0.2, width * 0.2)
        shift_y = random.uniform(-height * 0.2, height * 0.2)
        # print(bgr.shape,shift_x,shift_y)
        # 原图像的平移
        if shift_x >= 0 and shift_y >= 0:
            after_shfit_image[int(shift_y):, int(shift_x):, :] = bgr[:height - int(shift_y), :width - int(shift_x),
        elif shift_x >= 0 and shift_y < 0:
            after_shfit_image[:height + int(shift_y), int(shift_x):, :] = bgr[-int(shift_y):, :width - int(shift_x),
        elif shift_x < 0 and shift_y >= 0:
            after_shfit_image[int(shift_y):, :width + int(shift_x), :] = bgr[:height - int(shift_y), -int(shift_x):,
        elif shift_x < 0 and shift_y < 0:
            after_shfit_image[:height + int(shift_y), :width + int(shift_x), :] = bgr[-int(shift_y):,
                                                                                  -int(shift_x):, :]

        shift_xy = torch.FloatTensor([[int(shift_x), int(shift_y)]]).expand_as(center)
        center = center + shift_xy
        mask1 = (center[:, 0] > 0) & (center[:, 0] < width)
        mask2 = (center[:, 1] > 0) & (center[:, 1] < height)
        mask = (mask1 & mask2).view(-1, 1)
        boxes_in = boxes[mask.expand_as(boxes)].view(-1, 4)
        if len(boxes_in) == 0:
            return bgr, boxes, labels
        box_shift = torch.FloatTensor([[int(shift_x), int(shift_y), int(shift_x), int(shift_y)]]).expand_as(
        boxes_in = boxes_in + box_shift
        labels_in = labels[mask.view(-1)]
        return after_shfit_image, boxes_in, labels_in
    return bgr, boxes, labels

def randomScale(self, bgr, boxes):
    # 固定住高度,以0.8-1.2伸缩宽度,做图像形变
    if random.random() < 0.5:
        scale = random.uniform(0.8, 1.2)
        height, width, c = bgr.shape
        bgr = cv2.resize(bgr, (int(width * scale), height))
        scale_tensor = torch.FloatTensor([[scale, 1, scale, 1]]).expand_as(boxes)
        boxes = boxes * scale_tensor
        return bgr, boxes
    return bgr, boxes

def randomCrop(self, bgr, boxes, labels):
    if random.random() < 0.5:
        center = (boxes[:, 2:] + boxes[:, :2]) / 2
        height, width, c = bgr.shape
        h = random.uniform(0.6 * height, height)
        w = random.uniform(0.6 * width, width)
        x = random.uniform(0, width - w)
        y = random.uniform(0, height - h)
        x, y, h, w = int(x), int(y), int(h), int(w)

        center = center - torch.FloatTensor([[x, y]]).expand_as(center)
        mask1 = (center[:, 0] > 0) & (center[:, 0] < w)
        mask2 = (center[:, 1] > 0) & (center[:, 1] < h)
        mask = (mask1 & mask2).view(-1, 1)

        boxes_in = boxes[mask.expand_as(boxes)].view(-1, 4)
        if (len(boxes_in) == 0):
            return bgr, boxes, labels
        box_shift = torch.FloatTensor([[x, y, x, y]]).expand_as(boxes_in)

        boxes_in = boxes_in - box_shift
        boxes_in[:, 0] = boxes_in[:, 0].clamp_(min=0, max=w)
        boxes_in[:, 2] = boxes_in[:, 2].clamp_(min=0, max=w)
        boxes_in[:, 1] = boxes_in[:, 1].clamp_(min=0, max=h)
        boxes_in[:, 3] = boxes_in[:, 3].clamp_(min=0, max=h)

        labels_in = labels[mask.view(-1)]
        img_croped = bgr[y:y + h, x:x + w, :]
        return img_croped, boxes_in, labels_in
    return bgr, boxes, labels

def subMean(self, bgr, mean):
    mean = np.array(mean, dtype=np.float32)
    bgr = bgr - mean
    return bgr

def random_flip(self, im, boxes):
    if random.random() < 0.5:
        im_lr = np.fliplr(im).copy()
        h, w, _ = im.shape
        xmin = w - boxes[:, 2]
        xmax = w - boxes[:, 0]
        boxes[:, 0] = xmin
        boxes[:, 2] = xmax
        return im_lr, boxes
    return im, boxes

def random_bright(self, im, delta=16):
    alpha = random.random()
    if alpha > 0.3:
        im = im * alpha + random.randrange(-delta, delta)
        im = im.clip(min=0, max=255).astype(np.uint8)
    return im

4.3 调试代码

​ 通过一段简单的调试代码,来查看数据加载器是否运行正常:

# 3. 调试代码
def main():
    from import DataLoader
    import torchvision.transforms as transforms
    file_root = '../../data/VOC2012/JPEGImages/' # 记得改为自己的路径
    train_dataset = Yolo_Dataset(root=file_root,list_file='voctrain.txt',train=True,transforms = [T.ToTensor()] )
    train_loader = DataLoader(train_dataset,batch_size=1,shuffle=False,num_workers=0)
    train_iter = iter(train_loader)
    for i in range(100):
        img,target = next(train_iter)

​ 运行结果如下图:


4.4 完整代码

​ 该文件完整代码如下:

# author: baiCai
# 1. 导包
import os
import random
import numpy as np
import torch
import torchvision.transforms as T
from import Dataset
import cv2

# 2. 构建数据加载器
class Yolo_Dataset(Dataset):
    # 默认图片大小
    image_size = 448
    def __init__(self,root,list_file,train=True,transforms=None):
        :param root:  根目录,比如`./data/`,这个目录必须包含真正的图片
        :param list_file: 即我们之前使用generate_txt_file生成的txt文件,这里刚好和本文件在同一目录
        :param train: 是否为训练集,默认为True
        :param transforms: 预处理方法,默认为None
        # 初始化各个参数
        self.root = root
        self.train = train
        self.transform = transforms
        self.fnames = []
        self.boxes = []
        self.labels = []
        self.mean = (123, 117, 104)  # RGB
        # 打开文件,读取内容
        with open(list_file) as f:
            lines  = f.readlines()
        # 针对每一行进行处理
        for line in lines:
            splited = line.strip().split()
            self.fnames.append(splited[0])   # 文件名字
            # 读取文件的类别和box信息
            # 每五个数据表示一个对象,因此下面都是关于5的处理
            num_boxes = (len(splited) - 1) // 5
            for i in range(num_boxes):
                x = float(splited[1+5*i])
                y = float(splited[2+5*i])
                x2 = float(splited[3+5*i])
                y2 = float(splited[4+5*i])
                c = splited[5+5*i]
        # 记录一下对象个数
        self.num_samples = len(self.boxes)

    def __len__(self):
        # 返回对象个数
        return self.num_samples

    def __getitem__(self, idx):
        # 随便获取一张图片
        fname = self.fnames[idx]
        # 打开图片,记得把路径拼接完整
        img = cv2.imread(os.path.join(self.root + fname))
        # 获取图片的相关信息
        boxes = self.boxes[idx].clone()
        labels = self.labels[idx].clone()
        # 如果是训练模式,需要进行图像的增强
        # 需要注意的是,同时处理图像和box
        if self.train:
            img, boxes = self.random_flip(img, boxes)   # 随机翻转
            img, boxes = self.randomScale(img, boxes)   # 随机缩放
            img = self.randomBlur(img)      # 随机模糊
            img = self.RandomBrightness(img)    # 随机调整亮度
            img = self.RandomHue(img)   # 随机调整色调
            img = self.RandomSaturation(img) # 随机调整饱和度
            img, boxes, labels = self.randomShift(img, boxes, labels)   # 随机移动
            img, boxes, labels = self.randomCrop(img, boxes, labels)    # 随机裁剪
        # 进行基本的处理
        h, w, _ = img.shape
        # 归一化
        boxes /= torch.Tensor([w, h, w, h]).expand_as(boxes)
        # 主要是CV2打开模式默认为BGR,因此需要转为RGB模式
        img = self.BGR2RGB(img)
        # 减去均值
        img = self.subMean(img, self.mean)
        # 将图片缩放到指定大小,即448*448
        img = cv2.resize(img, (self.image_size, self.image_size))
        # 需要特别处理,将各种信息变为yolov1需要的7*7*30
        target = self.encoder(boxes, labels)  # 7x7x30
        # 最后,进行预处理操作
        for t in self.transform:
            img = t(img)

        return img, target

    def encoder(self,boxes,labels):
        boxes (tensor) [[x1,y1,x2,y2],[]]
        labels (tensor) [...]
        return 7x7x30
        grid_num = 7
        # 先创建一个全为0的张量,后面进行填充即可
        target = torch.zeros((grid_num,grid_num,30))
        # 缩放因子
        cell_size = 1./grid_num
        # 计算出w、h和中心点坐标
        wh = boxes[:,2:]-boxes[:,:2]
        cxcy = (boxes[:,2:]+boxes[:,:2])/2
        for i in range(cxcy.size()[0]):
            cxcy_sample = cxcy[i]       # 中心坐标
            ij = (cxcy_sample/cell_size).ceil()-1 # 左上角坐标,需要乘以缩放因子得到归一化后的坐标
            target[int(ij[1]),int(ij[0]),4] = 1
            target[int(ij[1]),int(ij[0]),9] = 1
            target[int(ij[1]),int(ij[0]),int(labels[i])+9] = 1
            # 匹配到的网格的左上角相对坐标
            xy = ij*cell_size
            # 相对偏移量
            delta_xy = (cxcy_sample -xy)/cell_size
            target[int(ij[1]),int(ij[0]),2:4] = wh[i]
            target[int(ij[1]),int(ij[0]),:2] = delta_xy
            target[int(ij[1]),int(ij[0]),7:9] = wh[i]
            target[int(ij[1]),int(ij[0]),5:7] = delta_xy
        return target

    # 下面是各种预处理算法,来自别人的代码直接拷贝过来的
    def BGR2RGB(self, img):
        return cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    def BGR2HSV(self, img):
        return cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

    def HSV2BGR(self, img):
        return cv2.cvtColor(img, cv2.COLOR_HSV2BGR)

    def RandomBrightness(self, bgr):
        if random.random() < 0.5:
            hsv = self.BGR2HSV(bgr)
            h, s, v = cv2.split(hsv)
            adjust = random.choice([0.5, 1.5])
            v = v * adjust
            v = np.clip(v, 0, 255).astype(hsv.dtype)
            hsv = cv2.merge((h, s, v))
            bgr = self.HSV2BGR(hsv)
        return bgr

    def RandomSaturation(self, bgr):
        if random.random() < 0.5:
            hsv = self.BGR2HSV(bgr)
            h, s, v = cv2.split(hsv)
            adjust = random.choice([0.5, 1.5])
            s = s * adjust
            s = np.clip(s, 0, 255).astype(hsv.dtype)
            hsv = cv2.merge((h, s, v))
            bgr = self.HSV2BGR(hsv)
        return bgr

    def RandomHue(self, bgr):
        if random.random() < 0.5:
            hsv = self.BGR2HSV(bgr)
            h, s, v = cv2.split(hsv)
            adjust = random.choice([0.5, 1.5])
            h = h * adjust
            h = np.clip(h, 0, 255).astype(hsv.dtype)
            hsv = cv2.merge((h, s, v))
            bgr = self.HSV2BGR(hsv)
        return bgr

    def randomBlur(self, bgr):
        if random.random() < 0.5:
            bgr = cv2.blur(bgr, (5, 5))
        return bgr

    def randomShift(self, bgr, boxes, labels):
        # 平移变换
        center = (boxes[:, 2:] + boxes[:, :2]) / 2
        if random.random() < 0.5:
            height, width, c = bgr.shape
            after_shfit_image = np.zeros((height, width, c), dtype=bgr.dtype)
            after_shfit_image[:, :, :] = (104, 117, 123)  # bgr
            shift_x = random.uniform(-width * 0.2, width * 0.2)
            shift_y = random.uniform(-height * 0.2, height * 0.2)
            # print(bgr.shape,shift_x,shift_y)
            # 原图像的平移
            if shift_x >= 0 and shift_y >= 0:
                after_shfit_image[int(shift_y):, int(shift_x):, :] = bgr[:height - int(shift_y), :width - int(shift_x),
            elif shift_x >= 0 and shift_y < 0:
                after_shfit_image[:height + int(shift_y), int(shift_x):, :] = bgr[-int(shift_y):, :width - int(shift_x),
            elif shift_x < 0 and shift_y >= 0:
                after_shfit_image[int(shift_y):, :width + int(shift_x), :] = bgr[:height - int(shift_y), -int(shift_x):,
            elif shift_x < 0 and shift_y < 0:
                after_shfit_image[:height + int(shift_y), :width + int(shift_x), :] = bgr[-int(shift_y):,
                                                                                      -int(shift_x):, :]

            shift_xy = torch.FloatTensor([[int(shift_x), int(shift_y)]]).expand_as(center)
            center = center + shift_xy
            mask1 = (center[:, 0] > 0) & (center[:, 0] < width)
            mask2 = (center[:, 1] > 0) & (center[:, 1] < height)
            mask = (mask1 & mask2).view(-1, 1)
            boxes_in = boxes[mask.expand_as(boxes)].view(-1, 4)
            if len(boxes_in) == 0:
                return bgr, boxes, labels
            box_shift = torch.FloatTensor([[int(shift_x), int(shift_y), int(shift_x), int(shift_y)]]).expand_as(
            boxes_in = boxes_in + box_shift
            labels_in = labels[mask.view(-1)]
            return after_shfit_image, boxes_in, labels_in
        return bgr, boxes, labels

    def randomScale(self, bgr, boxes):
        # 固定住高度,以0.8-1.2伸缩宽度,做图像形变
        if random.random() < 0.5:
            scale = random.uniform(0.8, 1.2)
            height, width, c = bgr.shape
            bgr = cv2.resize(bgr, (int(width * scale), height))
            scale_tensor = torch.FloatTensor([[scale, 1, scale, 1]]).expand_as(boxes)
            boxes = boxes * scale_tensor
            return bgr, boxes
        return bgr, boxes

    def randomCrop(self, bgr, boxes, labels):
        if random.random() < 0.5:
            center = (boxes[:, 2:] + boxes[:, :2]) / 2
            height, width, c = bgr.shape
            h = random.uniform(0.6 * height, height)
            w = random.uniform(0.6 * width, width)
            x = random.uniform(0, width - w)
            y = random.uniform(0, height - h)
            x, y, h, w = int(x), int(y), int(h), int(w)

            center = center - torch.FloatTensor([[x, y]]).expand_as(center)
            mask1 = (center[:, 0] > 0) & (center[:, 0] < w)
            mask2 = (center[:, 1] > 0) & (center[:, 1] < h)
            mask = (mask1 & mask2).view(-1, 1)

            boxes_in = boxes[mask.expand_as(boxes)].view(-1, 4)
            if (len(boxes_in) == 0):
                return bgr, boxes, labels
            box_shift = torch.FloatTensor([[x, y, x, y]]).expand_as(boxes_in)

            boxes_in = boxes_in - box_shift
            boxes_in[:, 0] = boxes_in[:, 0].clamp_(min=0, max=w)
            boxes_in[:, 2] = boxes_in[:, 2].clamp_(min=0, max=w)
            boxes_in[:, 1] = boxes_in[:, 1].clamp_(min=0, max=h)
            boxes_in[:, 3] = boxes_in[:, 3].clamp_(min=0, max=h)

            labels_in = labels[mask.view(-1)]
            img_croped = bgr[y:y + h, x:x + w, :]
            return img_croped, boxes_in, labels_in
        return bgr, boxes, labels

    def subMean(self, bgr, mean):
        mean = np.array(mean, dtype=np.float32)
        bgr = bgr - mean
        return bgr

    def random_flip(self, im, boxes):
        if random.random() < 0.5:
            im_lr = np.fliplr(im).copy()
            h, w, _ = im.shape
            xmin = w - boxes[:, 2]
            xmax = w - boxes[:, 0]
            boxes[:, 0] = xmin
            boxes[:, 2] = xmax
            return im_lr, boxes
        return im, boxes

    def random_bright(self, im, delta=16):
        alpha = random.random()
        if alpha > 0.3:
            im = im * alpha + random.randrange(-delta, delta)
            im = im.clip(min=0, max=255).astype(np.uint8)
        return im

# 3. 调试代码
def main():
    from import DataLoader
    import torchvision.transforms as transforms
    file_root = '../../data/VOC2012/JPEGImages/' # 记得改为自己的路径
    train_dataset = Yolo_Dataset(root=file_root,list_file='voctrain.txt',train=True,transforms = [transforms.ToTensor()] )
    train_loader = DataLoader(train_dataset,batch_size=1,shuffle=False,num_workers=0)
    train_iter = iter(train_loader)
    for i in range(100):
        img,target = next(train_iter)

if __name__ == '__main__':

5. 损失函数


​ 该文件名为Yolo_Loss.py位于utils文件夹内,而Yolov1的损失函数公式如下:(下面的A-E将会体现在代码的注释中,方便大家清楚代码写的是哪一部分的损失值


5.1 导包

# 1. 导包
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

5.2 构建损失函数


​ 这步主要是先把框架搭建起来:

# 2. 损失函数类
class Yolo_Loss(nn.Module):
    def __init__(self):
        super(Yolo_Loss, self).__init__()
    # 前向传播
    def forward(self):
    # 计算IOU的函数
    def compute_iou(self):

​ 其中,损失函数涉及到IOU的计算,因此肯定还是需要定义一个IOU计算的函数。


​ 这个方法简单,只需要把损失函数涉及的参数考虑进去即可:

def __init__(self,S=7, B=2, l_coord=5, l_noobj=0.5):
    :param S: Yolov1论文中的S,即划分的网格,默认为7 
    :param B: Yolov1论文中的B,即多少个框预测,默认为2
    :param l_coord:  损失函数中的超参数,默认为5
    :param l_noobj:  同上,默认为0.5
    super(Yolo_Loss, self).__init__()
    # 初始化各个参数
    self.S = S
    self.B = B
    self.l_coord = l_coord
    self.l_noobj = l_noobj


​ 需要根据损失函数公式,一步一步的构建,可以看下面的注释:

# 前向传播
def forward(self,pred_tensor,target_tensor):
    # 获取batchsize大小
    N = pred_tensor.size()[0]
    # 具有目标标签的索引值,此时shape为[batch,7,7]
    coo_mask = target_tensor[:, :, :, 4] > 0
    # 不具有目标的标签索引值,此时shape为[batch,7,7]
    noo_mask = target_tensor[:, :, :, 4] == 0
    # 将shape变为[batch,7,7,30]
    coo_mask = coo_mask.unsqueeze(-1).expand_as(target_tensor)
    noo_mask = noo_mask.unsqueeze(-1).expand_as(target_tensor)
    # 获取预测值中包含对象的所有点(共7*7个点),并转为[x,30]的形式,其中x表示有多少点中的框包含有对象
    coo_pred = pred_tensor[coo_mask].view(-1, 30)
    # 对上面获取的值进行处理
    # 1. 转为box形式:box[x1,y1,w1,h1,c1],shape为[2x,5],因为每个单元格/点有两个预测框
    box_pred = coo_pred[:, :10].contiguous().view(-1, 5)
    # 2. 转为class信息,即30中后面的20个值
    class_pred = coo_pred[:, 10:]
    # 同理,对真实值进行操作,方便对比计算损失值
    coo_target = target_tensor[coo_mask].view(-1, 30)
    box_target = coo_target[:, :10].contiguous().view(-1, 5)
    class_target = coo_target[:, 10:]
    # 同上的操作,获取不包含对象的预测值、真实值
    noo_pred = pred_tensor[noo_mask].view(-1, 30)
    noo_target = target_tensor[noo_mask].view(-1, 30)

    # 不包含物体grid ceil的置信度损失:即图中的D部分
    # 1. 自己创建一个索引
    noo_pred_mask = torch.cuda.ByteTensor(noo_pred.size())
    noo_pred_mask.zero_() # 将全部元素变为Flase的意思
    # 2. 将其它位置的索引置为0,唯独两个框的置信度位置变为1
    noo_pred_mask[:, 4] = 1
    noo_pred_mask[:, 9] = 1
    # 3. 获取对应的值
    noo_pred_c = noo_pred[noo_pred_mask]  # noo pred只需要计算 c 的损失 size[-1,2]
    noo_target_c = noo_target[noo_pred_mask]
    # 4. 计算损失值:均方误差
    nooobj_loss = F.mse_loss(noo_pred_c, noo_target_c, size_average=False)

    # 计算包含物体的损失值
    # 创建几个全为False/0的变量,用于后期存储值
    coo_response_mask = torch.cuda.ByteTensor(box_target.size()) # 负责预测框
    coo_not_response_mask = torch.cuda.ByteTensor(box_target.size()) # 不负责预测的框的索引(因为一个cell两个预测框,而只有IOU最大的负责索引)
    box_target_iou = torch.zeros(box_target.size()).cuda() # 具体的IOU值存放处
    # 由于一个单元格两个预测框,因此step=2
    for i in range(0, box_target.size()[0], 2):  # choose the best iou box
        # 获取预测值中的两个box
        box1 = box_pred[i:i + 2] # [x,y,w,h,c]
        # 创建一个临时变量,用于存储中左上角+右下角坐标值,因为计算IOU需要
        box1_xyxy = Variable(torch.FloatTensor(box1.size()))
        # 下面将中心坐标+高宽 转为 左上角+右下角坐标的形式,并归一化
        box1_xyxy[:, :2] = box1[:, :2] / float(self.S) - 0.5 * box1[:, 2:4] # 原本(xc,yc)为7*7 所以要除以7
        box1_xyxy[:, 2:4] = box1[:, :2] / float(self.S) + 0.5 * box1[:, 2:4]
        # 用同样的思路对真实值进行处理,不过不同的是真实值一个对象只有一个框
        box2 = box_target[i].view(-1, 5)
        box2_xyxy = Variable(torch.FloatTensor(box2.size()))
        box2_xyxy[:, :2] = box2[:, :2] / float(self.S) - 0.5 * box2[:, 2:4]
        box2_xyxy[:, 2:4] = box2[:, :2] / float(self.S) + 0.5 * box2[:, 2:4]
        # 计算两者的IOU
        iou = self.compute_iou(box1_xyxy[:, :4], box2_xyxy[:, :4])  # 前者shape为[2,4],后者为[1,4]
        #  获取两者IOU最大的值和索引,因为一个cell有两个预测框,一般而言取IOU最大的作为预测框
        max_iou, max_index = iou.max(0)
        max_index =
        # 将IOU最大的索引设置为1,即表示这个框负责预测
        coo_response_mask[i + max_index] = 1
        # 将不是IOU最大的索引设置为1,即表示这个预测框不负责预测
        coo_not_response_mask[i + 1 - max_index] = 1
        # 获取具体的IOU值
        box_target_iou[i + max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
    box_target_iou = Variable(box_target_iou).cuda()
    # 获取负责预测框的值、IOU值和真实框的值
    box_pred_response = box_pred[coo_response_mask].view(-1, 5)
    box_target_response_iou = box_target_iou[coo_response_mask].view(-1, 5)
    box_target_response = box_target[coo_response_mask].view(-1, 5)
    #  这个对应的是图中的部分C,负责预测框的损失
    contain_loss = F.mse_loss(box_pred_response[:, 4], box_target_response_iou[:, 4], size_average=False)
    # 1. 计算坐标损失,即图中的A和B部分
    loc_loss = F.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) + F.mse_loss(
        torch.sqrt(box_pred_response[:, 2:4]), torch.sqrt(box_target_response[:, 2:4]), size_average=False)
    # 获取不负责预测框的值、真实值
    box_pred_not_response = box_pred[coo_not_response_mask].view(-1, 5)
    box_target_not_response = box_target[coo_not_response_mask].view(-1, 5)
    box_target_not_response[:, 4] = 0 # 将真实值置为0
    # 2. 计算不负责预测框的损失值,即图中的部分C
    not_contain_loss = F.mse_loss(box_pred_not_response[:, 4], box_target_not_response[:, 4], size_average=False)
    # 3. 类别损失,即图中的E部分
    class_loss = F.mse_loss(class_pred, class_target, size_average=False)
    return (self.l_coord * loc_loss +  contain_loss + not_contain_loss + self.l_noobj * nooobj_loss  + class_loss) / N


​ 当完成上面的构建后,就只差一个计算IOU的方法了,这个方法很简单,只需要记住IOU的计算方法,即交集除以并集即可。

​ 辅助理解的图片:


​ 代码如下:

# 计算IOU的函数
def compute_iou(self, box1, box2):
    :param box1: 预测的box,一般为[2,4]
    :param box2: 真实的box,一般为[1,4]
    # 获取各box个数
    N = box1.size(0)
    M = box2.size(0)
    # 计算两者中左上角左边较大的
    lt = torch.max(
        box1[:, :2].unsqueeze(1).expand(N, M, 2),  # [N,2] -> [N,1,2] -> [N,M,2]
        box2[:, :2].unsqueeze(0).expand(N, M, 2),  # [M,2] -> [1,M,2] -> [N,M,2]
    # 计算两者右下角左边较小的
    rb = torch.min(
        box1[:, 2:].unsqueeze(1).expand(N, M, 2),  # [N,2] -> [N,1,2] -> [N,M,2]
        box2[:, 2:].unsqueeze(0).expand(N, M, 2),  # [M,2] -> [1,M,2] -> [N,M,2]
    # 计算两者相交部分的长、宽
    wh = rb - lt  # [N,M,2]
    # 如果长、宽中有小于0的,表示可能没有相交趋于,置为0即可
    wh[wh < 0] = 0  # clip at 0
    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

    # 计算各个的面积
    # box1的面积
    area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])  # [N,]
    # box2的面积
    area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])  # [M,]
    area1 = area1.unsqueeze(1).expand_as(inter)  # [N,] -> [N,1] -> [N,M]
    area2 = area2.unsqueeze(0).expand_as(inter)  # [M,] -> [1,M] -> [N,M]
    # IOu值,交集除以并集,其中并集为两者的面积和减去交集部分
    iou = inter / (area1 + area2 - inter)
    return iou

5.3 完整代码

# author: baiCai
# 1. 导包
import torch
import torch.nn as nn
import torch.nn.functional as F

# 2. 损失函数类
class Yolo_Loss(nn.Module):
    def __init__(self,S=7, B=2, l_coord=5, l_noobj=0.5):
        :param S: Yolov1论文中的S,即划分的网格,默认为7
        :param B: Yolov1论文中的B,即多少个框预测,默认为2
        :param l_coord:  损失函数中的超参数,默认为5
        :param l_noobj:  同上,默认为0.5
        super(Yolo_Loss, self).__init__()
        # 初始化各个参数
        self.S = S
        self.B = B
        self.l_coord = l_coord
        self.l_noobj = l_noobj

    # 前向传播
    def forward(self,pred_tensor,target_tensor):
        # 获取batchsize大小
        N = pred_tensor.size()[0]
        # 具有目标标签的索引值,此时shape为[batch,7,7]
        coo_mask = target_tensor[:, :, :, 4] > 0
        # 不具有目标的标签索引值,此时shape为[batch,7,7]
        noo_mask = target_tensor[:, :, :, 4] == 0
        # 将shape变为[batch,7,7,30]
        coo_mask = coo_mask.unsqueeze(-1).expand_as(target_tensor)
        noo_mask = noo_mask.unsqueeze(-1).expand_as(target_tensor)
        # 获取预测值中包含对象的所有点(共7*7个点),并转为[x,30]的形式,其中x表示有多少点中的框包含有对象
        coo_pred = pred_tensor[coo_mask].view(-1, 30)
        # 对上面获取的值进行处理
        # 1. 转为box形式:box[x1,y1,w1,h1,c1],shape为[2x,5],因为每个单元格/点有两个预测框
        box_pred = coo_pred[:, :10].contiguous().view(-1, 5)
        # 2. 转为class信息,即30中后面的20个值
        class_pred = coo_pred[:, 10:]
        # 同理,对真实值进行操作,方便对比计算损失值
        coo_target = target_tensor[coo_mask].view(-1, 30)
        box_target = coo_target[:, :10].contiguous().view(-1, 5)
        class_target = coo_target[:, 10:]
        # 同上的操作,获取不包含对象的预测值、真实值
        noo_pred = pred_tensor[noo_mask].view(-1, 30)
        noo_target = target_tensor[noo_mask].view(-1, 30)

        # 不包含物体grid ceil的置信度损失:即图中的D部分
        # 1. 自己创建一个索引
        noo_pred_mask = torch.cuda.ByteTensor(noo_pred.size())
        noo_pred_mask.zero_() # 将全部元素变为Flase的意思
        # 2. 将其它位置的索引置为0,唯独两个框的置信度位置变为1
        noo_pred_mask[:, 4] = 1
        noo_pred_mask[:, 9] = 1
        # 3. 获取对应的值
        noo_pred_c = noo_pred[noo_pred_mask]  # noo pred只需要计算 c 的损失 size[-1,2]
        noo_target_c = noo_target[noo_pred_mask]
        # 4. 计算损失值:均方误差
        nooobj_loss = F.mse_loss(noo_pred_c, noo_target_c, size_average=False)

        # 计算包含物体的损失值
        # 创建几个全为False/0的变量,用于后期存储值
        coo_response_mask = torch.cuda.ByteTensor(box_target.size()) # 负责预测框
        coo_not_response_mask = torch.cuda.ByteTensor(box_target.size()) # 不负责预测的框的索引(因为一个cell两个预测框,而只有IOU最大的负责索引)
        box_target_iou = torch.zeros(box_target.size()).cuda() # 具体的IOU值存放处
        # 由于一个单元格两个预测框,因此step=2
        for i in range(0, box_target.size()[0], 2):  # choose the best iou box
            # 获取预测值中的两个box
            box1 = box_pred[i:i + 2] # [x,y,w,h,c]
            # 创建一个临时变量,用于存储中左上角+右下角坐标值,因为计算IOU需要
            box1_xyxy = torch.FloatTensor(box1.size(),requires_grad=True)
            # 下面将中心坐标+高宽 转为 左上角+右下角坐标的形式,并归一化
            box1_xyxy[:, :2] = box1[:, :2] / float(self.S) - 0.5 * box1[:, 2:4] # 原本(xc,yc)为7*7 所以要除以7
            box1_xyxy[:, 2:4] = box1[:, :2] / float(self.S) + 0.5 * box1[:, 2:4]
            # 用同样的思路对真实值进行处理,不过不同的是真实值一个对象只有一个框
            box2 = box_target[i].view(-1, 5)
            box2_xyxy = torch.FloatTensor(box2.size(),requires_grad=True)
            box2_xyxy[:, :2] = box2[:, :2] / float(self.S) - 0.5 * box2[:, 2:4]
            box2_xyxy[:, 2:4] = box2[:, :2] / float(self.S) + 0.5 * box2[:, 2:4]
            # 计算两者的IOU
            iou = self.compute_iou(box1_xyxy[:, :4], box2_xyxy[:, :4])  # 前者shape为[2,4],后者为[1,4]
            #  获取两者IOU最大的值和索引,因为一个cell有两个预测框,一般而言取IOU最大的作为预测框
            max_iou, max_index = iou.max(0)
            max_index =
            # 将IOU最大的索引设置为1,即表示这个框负责预测
            coo_response_mask[i + max_index] = 1
            # 将不是IOU最大的索引设置为1,即表示这个预测框不负责预测
            coo_not_response_mask[i + 1 - max_index] = 1
            # 获取具体的IOU值
            box_target_iou[i + max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
        box_target_iou = box_target_iou.cuda()
        # 获取负责预测框的值、IOU值和真实框的值
        box_pred_response = box_pred[coo_response_mask].view(-1, 5)
        box_target_response_iou = box_target_iou[coo_response_mask].view(-1, 5)
        box_target_response = box_target[coo_response_mask].view(-1, 5)
        #  这个对应的是图中的部分C,负责预测框的损失
        contain_loss = F.mse_loss(box_pred_response[:, 4], box_target_response_iou[:, 4], size_average=False)
        # 1. 计算坐标损失,即图中的A和B部分
        loc_loss = F.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) + F.mse_loss(
            torch.sqrt(box_pred_response[:, 2:4]), torch.sqrt(box_target_response[:, 2:4]), size_average=False)
        # 获取不负责预测框的值、真实值
        box_pred_not_response = box_pred[coo_not_response_mask].view(-1, 5)
        box_target_not_response = box_target[coo_not_response_mask].view(-1, 5)
        box_target_not_response[:, 4] = 0 # 将真实值置为0
        # 2. 计算不负责预测框的损失值,即图中的部分C
        not_contain_loss = F.mse_loss(box_pred_not_response[:, 4], box_target_not_response[:, 4], size_average=False)
        # 3. 类别损失,即图中的E部分
        class_loss = F.mse_loss(class_pred, class_target, size_average=False)
        return (self.l_coord * loc_loss +  contain_loss + not_contain_loss + self.l_noobj * nooobj_loss  + class_loss) / N

    # 计算IOU的函数
    def compute_iou(self, box1, box2):
        :param box1: 预测的box,一般为[2,4]
        :param box2: 真实的box,一般为[1,4]
        # 获取各box个数
        N = box1.size(0)
        M = box2.size(0)
        # 计算两者中左上角左边较大的
        lt = torch.max(
            box1[:, :2].unsqueeze(1).expand(N, M, 2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:, :2].unsqueeze(0).expand(N, M, 2),  # [M,2] -> [1,M,2] -> [N,M,2]
        # 计算两者右下角左边较小的
        rb = torch.min(
            box1[:, 2:].unsqueeze(1).expand(N, M, 2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:, 2:].unsqueeze(0).expand(N, M, 2),  # [M,2] -> [1,M,2] -> [N,M,2]
        # 计算两者相交部分的长、宽
        wh = rb - lt  # [N,M,2]
        # 如果长、宽中有小于0的,表示可能没有相交趋于,置为0即可
        wh[wh < 0] = 0  # clip at 0
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        # 计算各个的面积
        # box1的面积
        area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])  # [N,]
        # box2的面积
        area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])  # [M,]
        area1 = area1.unsqueeze(1).expand_as(inter)  # [N,] -> [N,1] -> [N,M]
        area2 = area2.unsqueeze(0).expand_as(inter)  # [M,] -> [1,M] -> [N,M]
        # IOu值,交集除以并集,其中并集为两者的面积和减去交集部分
        iou = inter / (area1 + area2 - inter)
        return iou

6. 训练


​ 该文件名为train.py位于主目录下。

6.1 导包

# 1. 导入所需的包
import warnings
from tqdm import tqdm
import torch
from import DataLoader
import torchvision.transforms as T
from torchvision import models

from network.My_ResNet import resnet50
from utils.Yolo_Loss import Yolo_Loss
from utils.My_Dataset import Yolo_Dataset

6.2 实现训练代码


# 2. 定义基本参数
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
batch_size = 2      # 根据自己的电脑设定
epochs = 50
lr = 0.01
file_root = '../data/VOC2012/JPEGImages/'   # 需要根据的实际路径修改


# 3. 创建模型并继承预训练参数
pytorch_resnet = models.resnet50(pretrained=True)   # 官方的resnet50预训练模型
model = resnet50()      # 创建自己的resnet50,
# 接下来就是让自己的模型去继承官方的权重参数
pytorch_state_dict = pytorch_resnet.state_dict()
model_state_dict = model.state_dict()
for k in pytorch_state_dict.keys():
    # 调试: 看看模型哪些有没有问题
    # print(k)
    # 如果自己的模型和官方的模型key相同,并且不是fc层,则继承过来
    if k in model_state_dict.keys() and not k.startswith('fc'):
        model_state_dict[k] = pytorch_state_dict[k]


# 4. 损失函数,优化器,并将模型、损失函数放入GPU中
loss = Yolo_Loss()
optimizer = torch.optim.SGD(model.parameters(),lr=lr,momentum=0.9,weight_decay=5e-4)


# 5. 加载数据
train_dataset = Yolo_Dataset(root=file_root,list_file='./utils/voctrain.txt',train=True,transforms = [T.ToTensor()])
train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,drop_last=True)
test_dataset = Yolo_Dataset(root=file_root,list_file='./utils/voctest.txt',train=False,transforms = [T.ToTensor()])
test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,drop_last=True)


# 6. 训练
# 打印一些基本的信息
print('starting train the model')
print('the train_dataset has %d images' % len(train_dataset))
print('the batch_size is ',batch_size)
# 定义一个最佳损失值
best_test_loss = 0
# 开始训练
for e in range(epochs):
    # 调整学习率
    if e == 20:
        print('change the lr')
        optimizer.param_groups[0]['lr'] /= 10
    if e == 35:
        print('change the lr')
        optimizer.param_groups[0]['lr'] /= 10
    # 进度条显示
    tqdm_tarin = tqdm(train_loader)
    # 定义损失变量
    total_loss = 0.
    for i,(images,target) in enumerate(tqdm_tarin):
        # 将变量放入设备中
        images,target =,
        # 训练--损失等
        pred = model(images)
        loss_value = loss(pred,target)
        total_loss += loss_value.item()
        # 打印一下损失值
        if (i+1) % 5 == 0:
            tqdm_tarin.desc = 'train epoch[{}/{}] loss:{:.6f}'.format(e+1,epochs,total_loss/(i+1))
    # 启用验证模式
    validation_loss = 0.0
    tqdm_test = tqdm(test_loader)
    for i, (images, target) in enumerate(tqdm_test):
        images, target = images.cuda(), target.cuda()
        pred = model(images)
        loss_value = loss(pred, target)
        validation_loss += loss_value.item()
    validation_loss /= len(test_loader)
    # 显示验证集的损失值
    print('In the test step,the average loss is %.6f' % validation_loss)
    # 如果最佳损失值大于验证集的损失,意味着当前训练很好
    # 这一点需要设置好最佳的损失值,不容易设置
    # 是否启用看大家心情
    # if best_test_loss > validation_loss:
    #     best_test_loss = validation_loss
    #     print('get best test loss %.5f' % best_test_loss)
    #, './save_weights/best.pth')
    # 记得最后保存参数, './save_weights/yolo.pth')

6.3 训练过程展示

​ 如下动图所示:


​ 不过,由于我电脑GPU真的太逊了,所以一个epoch都训练得非常慢。

6.4 完整代码

# author: baiCai
# 1. 导入所需的包
import warnings
from tqdm import tqdm
import torch
from import DataLoader
import torchvision.transforms as T
from torchvision import models

from network.My_ResNet import resnet50
from utils.Yolo_Loss import Yolo_Loss
from utils.My_Dataset import Yolo_Dataset

# 2. 定义基本参数
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
batch_size = 2      # 根据自己的电脑设定
epochs = 50
lr = 0.01
file_root = '../data/VOC2012/JPEGImages/'   # 需要根据的实际路径修改
# 3. 创建模型并继承预训练参数
pytorch_resnet = models.resnet50(pretrained=True)   # 官方的resnet50预训练模型
model = resnet50()      # 创建自己的resnet50,
# 接下来就是让自己的模型去继承官方的权重参数
pytorch_state_dict = pytorch_resnet.state_dict()
model_state_dict = model.state_dict()
for k in pytorch_state_dict.keys():
    # 调试: 看看模型哪些有没有问题
    # print(k)
    # 如果自己的模型和官方的模型key相同,并且不是fc层,则继承过来
    if k in model_state_dict.keys() and not k.startswith('fc'):
        model_state_dict[k] = pytorch_state_dict[k]
# 4. 损失函数,优化器,并将模型、损失函数放入GPU中
loss = Yolo_Loss()
optimizer = torch.optim.SGD(model.parameters(),lr=lr,momentum=0.9,weight_decay=5e-4)
# 5. 加载数据
train_dataset = Yolo_Dataset(root=file_root,list_file='./utils/voctrain.txt',train=True,transforms = [T.ToTensor()])
train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,drop_last=True)
test_dataset = Yolo_Dataset(root=file_root,list_file='./utils/voctest.txt',train=False,transforms = [T.ToTensor()])
test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,drop_last=True)
# 6. 训练
# 打印一些基本的信息
print('starting train the model')
print('the train_dataset has %d images' % len(train_dataset))
print('the batch_size is ',batch_size)
# 定义一个最佳损失值
best_test_loss = 0
# 开始训练
for e in range(epochs):
    # 调整学习率
    if e == 20:
        print('change the lr')
        optimizer.param_groups[0]['lr'] /= 10
    if e == 35:
        print('change the lr')
        optimizer.param_groups[0]['lr'] /= 10
    # 进度条显示
    tqdm_tarin = tqdm(train_loader)
    # 定义损失变量
    total_loss = 0.
    for i,(images,target) in enumerate(tqdm_tarin):
        # 将变量放入设备中
        images,target =,
        # 训练--损失等
        pred = model(images)
        loss_value = loss(pred,target)
        total_loss += loss_value.item()
        # 打印一下损失值
        if (i+1) % 5 == 0:
            tqdm_tarin.desc = 'train epoch[{}/{}] loss:{:.6f}'.format(e+1,epochs,total_loss/(i+1))
    # 启用验证模式
    validation_loss = 0.0
    tqdm_test = tqdm(test_loader)
    for i, (images, target) in enumerate(tqdm_test):
        images, target = images.cuda(), target.cuda()
        pred = model(images)
        loss_value = loss(pred, target)
        validation_loss += loss_value.item()
    validation_loss /= len(test_loader)
    # 显示验证集的损失值
    print('In the test step,the average loss is %.6f' % validation_loss)
    # 如果最佳损失值大于验证集的损失,意味着当前训练很好
    # 这一点需要设置好最佳的损失值,不容易设置
    # 是否启用看大家心情
    # if best_test_loss > validation_loss:
    #     best_test_loss = validation_loss
    #     print('get best test loss %.5f' % best_test_loss)
    #, './save_weights/best.pth')
    # 记得最后保存参数, './save_weights/yolo.pth')

7. 测试


​ 本文件名为predict.py位于主目录下。

7.1 导包

# 1. 导包
import os
import random
import torch
from torch.autograd import Variable
import torchvision.transforms as transforms
import cv2
from matplotlib import pyplot as plt
import numpy as np
import warnings

from network.My_ResNet import resnet50

7.2 实现预测代码


​ 主要是类别索引,还有画图时候用到的颜色变量:

# 2. 定义一些基本的参数
# 类别索引
    'aeroplane', 'bicycle', 'bird', 'boat',
    'bottle', 'bus', 'car', 'cat', 'chair',
    'cow', 'diningtable', 'dog', 'horse',
    'motorbike', 'person', 'pottedplant',
'sheep', 'sofa', 'train', 'tvmonitor')
# 画矩形框的时候用到的颜色变量
Color = [[0, 0, 0],
                    [128, 0, 0],
                    [0, 128, 0],
                    [128, 128, 0],
                    [0, 0, 128],
                    [128, 0, 128],
                    [0, 128, 128],
                    [128, 128, 128],
                    [64, 0, 0],
                    [192, 0, 0],
                    [64, 128, 0],
                    [192, 128, 0],
                    [64, 0, 128],
                    [192, 0, 128],
                    [64, 128, 128],
                    [192, 128, 128],
                    [0, 64, 0],
                    [128, 64, 0],
                    [0, 192, 0],
                    [128, 192, 0],
                    [0, 64, 128]]


​ 解码函数是预测代码的关键,实现的代码如下:

# 3. 解码函数
def decoder(pred):
    :param pred: batchx7x7x30,但是预测的时候一般一张图片一张的放,因此batch=1
    :return: box[[x1,y1,x2,y2]] label[...]
    # 定义一些基本的参数
    grid_num = 7        # 网格划分标准大小
    probs = []
    cell_size = 1./grid_num # 缩放因子
    # 获取一些值
    pred =    # 预测值的数据:1*7*7*30
    pred = pred.squeeze(0) # 预测值的数据:7x7x30
    contain1 = pred[:,:,4].unsqueeze(2)  # 先获取第一个框的置信度,然后升维变为7*7*1
    contain2 = pred[:,:,9].unsqueeze(2) # 同上,只是为第二个框
    contain =,contain2),2) # 拼接在一起,变为7*7*2
    mask1 = contain > 0.1 #大于阈值0.1,设置为True
    mask2 = (contain==contain.max()) # 找出置信度最大的,设置为True
    mask = (mask1+mask2).gt(0) # 将mask1+mask2,让其中大于0的设置为True
    # 开始迭代每个单元格,即7*7个
    for i in range(grid_num):
        for j in range(grid_num):
            # 迭代两个预测框
            for b in range(2):
                # 如果mask为1,表示这个框是最大的置信度框
                if mask[i,j,b] == 1:
                    # 获取坐标值
                    box = pred[i,j,b*5:b*5+4]
                    # 获取置信度值
                    contain_prob = torch.FloatTensor([pred[i,j,b*5+4]])
                    # 将7*7的坐标,归一化
                    xy = torch.FloatTensor([j,i])*cell_size #cell左上角  up left of cell
                    box[:2] = box[:2]*cell_size + xy
                    # 将[cx,cy,w,h]转为[x1,xy1,x2,y2]
                    box_xy = torch.FloatTensor(box.size())      # 重新创建一个变量存储值
                    box_xy[:2] = box[:2] - 0.5*box[2:] # 这个就是中心坐标加减宽度/高度得到左上角/右下角坐标
                    box_xy[2:] = box[:2] + 0.5*box[2:]
                    # 获取最大的概率和类别索引值
                    max_prob,cls_index = torch.max(pred[i,j,10:],0)
                    # 如果置信度 * 类别概率 > 0.1,即说明有一定的可信度
                    # 那么把值加入各个变量列表中
                    if float((contain_prob*max_prob)[0]) > 0.1:
    # 如果boxes为0,表示没有框,返回0
    if len(boxes) ==0:
        boxes = torch.zeros((1,4))
        probs = torch.zeros(1)
        cls_indexs = torch.zeros(1)
    # 否则,进行处理,就是简单把原来的列表值[tensor,tensor]转为tensor的形式
    # 里面的值不变
        boxes =,0) #(n,4)
        probs =,0) #(n,)
        cls_indexs =,0) #(n,)
    # 后处理——NMS
    keep = nms(boxes,probs)
    # 返回值
    return boxes[keep],cls_indexs[keep],probs[keep]


​ 根据上面的要求,肯定还是需要实现NMS方法的,这个方法用来过滤一遍预测框,选取出质量相对较高的预测框。

​ 实现代码如下:

# 4. NMS处理
def nms(bboxes,scores,threshold=0.5):
    :param bboxes:  bboxes(tensor) [N,4]
    :param scores:  scores(tensor) [N,]
    :param threshold: 阈值
    :return: 返回过滤后的框
    # 获取各个框的坐标值
    x1 = bboxes[:,0]
    y1 = bboxes[:,1]
    x2 = bboxes[:,2]
    y2 = bboxes[:,3]
    # 计算面积
    areas = (x2-x1) * (y2-y1)
    # 将置信度按照降序排序,并获取排序后的各个置信度在这个顺序中的索引
    _,order = scores.sort(0,descending=True)
    keep = []
    # 判断order中的元素个数是否大于0
    while order.numel() > 0:
        # 如果元素个数只剩下一个了,结束循环
        if order.numel() == 1:
            i = order.item()
        # 获取最大置信度的索引
        i = order[0]
        # 对后面的元素坐标进行截断处理
        xx1 = x1[order[1:]].clamp(min=x1[i]) # min指的是小于它的设置为它的值,大于它的不管
        yy1 = y1[order[1:]].clamp(min=y1[i])
        xx2 = x2[order[1:]].clamp(max=x2[i])
        yy2 = y2[order[1:]].clamp(max=y2[i])
        # 此时的xx1,yy1等是排除了目前选中的框的,即假设x1有三个元素,那么xx1只有2个元素
        # 获取排序后的长和宽以及面积,如果小于0则设置为0
        w = (xx2-xx1).clamp(min=0)
        h = (yy2-yy1).clamp(min=0)
        inter = w*h

        # 准备更新order、
        # 计算选中的框和剩下框的IOU值
        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        # 如果 IOU小于设定的阈值,说明需要保存下来继续筛选(NMS原理)
        ids = (ovr<=threshold).nonzero().squeeze()
        if ids.numel() == 0:
        order = order[ids+1]
    return torch.LongTensor(keep)


​ 这一步就是比较常规的流程了,即打开图片、预测图片、处理预测结果,最后将处理好的结果返回即可。

​ 代码如下:

# 5. 预测函数
def predict_single(model, image_name, root_path=''):
    result = []  # 保存结果的变量
    # 打开图片
    image = cv2.imread(root_path + image_name)
    h, w, _ = image.shape
    # resize为模型的输入大小,即448*448
    img = cv2.resize(image, (448, 448))
    # 由于我们模型那里定义的颜色模式为RGB,因此这里需要转换
    mean = (123, 117, 104)  # RGB均值
    img = img - np.array(mean, dtype=np.float32)
    # 预处理
    transform = transforms.Compose([transforms.ToTensor(), ])
    img = transform(img)
    img = Variable(img[None, :, :, :], volatile=True)
    img = img.cuda()
    # 开始预测
    pred = model(img)  # 1x7x7x30
    pred = pred.cpu()
    # 解码
    boxes, cls_indexs, probs = decoder(pred)
    # 开始迭代每个框
    for i, box in enumerate(boxes):
        # 获取相关坐标,只是需要把原来归一化后的坐标转回去
        x1 = int(box[0] * w)
        x2 = int(box[2] * w)
        y1 = int(box[1] * h)
        y2 = int(box[3] * h)
        # 获取类别索引、概率等值
        cls_index = cls_indexs[i]
        cls_index = int(cls_index)  # convert LongTensor to int
        prob = probs[i]
        prob = float(prob)
        # 把这些值集中放入一个变量中返回
        result.append([(x1, y1), (x2, y2), VOC_CLASSES[cls_index], image_name, prob])
    return result


​ 上面所有的函数构建完成了,我们可以来写调用代码了,这一步部分的难点在于在图片上画出矩形框、文字信息等,不过说难其实也不难,主要是熟悉cv2这个库关于画图的函数即可。

​ 代码如下:

if __name__ == '__main__':
    # 慢慢的显示
    import time
    # 创建模型,加载参数
    model = resnet50()
    # 设置图片路径
    # base_path = './test_images/'
    base_path = '../data/VOC2012/JPEGImages/'
    image_name_list = [base_path+i for i in os.listdir(base_path)]
    # 打乱顺序
    print('stating predicting....')
    for image_name in image_name_list:
        image = cv2.imread(image_name)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        result = predict_single(model, image_name)
        # 画矩形框和对应的类别信息
        for left_up, right_bottom, class_name, _, prob in result:
            # 获取颜色
            color = Color[VOC_CLASSES.index(class_name)]
            # 画矩形
            cv2.rectangle(image, left_up, right_bottom, color, 2)
            # 获取类型信息和对应概率,此时为str类型
            label = class_name + str(round(prob, 2))
            # 把类别和概率信息写上,还要为这个信息加上一个矩形框
            text_size, baseline = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.4, 1)
            p1 = (left_up[0], left_up[1] - text_size[1])
            cv2.rectangle(image, (p1[0] - 2 // 2, p1[1] - 2 - baseline), (p1[0] + text_size[0], p1[1] + text_size[1]),
                          color, -1)
            cv2.putText(image, label, (p1[0], p1[1] + baseline), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255, 255, 255), 1, 8)

        # 显示图片
        # 是否保存结果图片
        # cv2.imwrite('./test_images/result.jpg', image)

7.3 预测结果展示


7.4 完整代码

# author: baiCai
# 1. 导包
import os
import random
import torch
from torch.autograd import Variable
import torchvision.transforms as transforms
import cv2
from matplotlib import pyplot as plt
import numpy as np
import warnings

from network.My_ResNet import resnet50

# 2. 定义一些基本的参数
# 类别索引
    'aeroplane', 'bicycle', 'bird', 'boat',
    'bottle', 'bus', 'car', 'cat', 'chair',
    'cow', 'diningtable', 'dog', 'horse',
    'motorbike', 'person', 'pottedplant',
'sheep', 'sofa', 'train', 'tvmonitor')
# 画矩形框的时候用到的颜色变量
Color = [[0, 0, 0],
                    [128, 0, 0],
                    [0, 128, 0],
                    [128, 128, 0],
                    [0, 0, 128],
                    [128, 0, 128],
                    [0, 128, 128],
                    [128, 128, 128],
                    [64, 0, 0],
                    [192, 0, 0],
                    [64, 128, 0],
                    [192, 128, 0],
                    [64, 0, 128],
                    [192, 0, 128],
                    [64, 128, 128],
                    [192, 128, 128],
                    [0, 64, 0],
                    [128, 64, 0],
                    [0, 192, 0],
                    [128, 192, 0],
                    [0, 64, 128]]

# 3. 解码函数
def decoder(pred):
    :param pred: batchx7x7x30,但是预测的时候一般一张图片一张的放,因此batch=1
    :return: box[[x1,y1,x2,y2]] label[...]
    # 定义一些基本的参数
    grid_num = 7        # 网格划分标准大小
    probs = []
    cell_size = 1./grid_num # 缩放因子
    # 获取一些值
    pred =    # 预测值的数据:1*7*7*30
    pred = pred.squeeze(0) # 预测值的数据:7x7x30
    contain1 = pred[:,:,4].unsqueeze(2)  # 先获取第一个框的置信度,然后升维变为7*7*1
    contain2 = pred[:,:,9].unsqueeze(2) # 同上,只是为第二个框
    contain =,contain2),2) # 拼接在一起,变为7*7*2
    mask1 = contain > 0.1 #大于阈值0.1,设置为True
    mask2 = (contain==contain.max()) # 找出置信度最大的,设置为True
    mask = (mask1+mask2).gt(0) # 将mask1+mask2,让其中大于0的设置为True
    # 开始迭代每个单元格,即7*7个
    for i in range(grid_num):
        for j in range(grid_num):
            # 迭代两个预测框
            for b in range(2):
                # 如果mask为1,表示这个框是最大的置信度框
                if mask[i,j,b] == 1:
                    # 获取坐标值
                    box = pred[i,j,b*5:b*5+4]
                    # 获取置信度值
                    contain_prob = torch.FloatTensor([pred[i,j,b*5+4]])
                    # 将7*7的坐标,归一化
                    xy = torch.FloatTensor([j,i])*cell_size #cell左上角  up left of cell
                    box[:2] = box[:2]*cell_size + xy
                    # 将[cx,cy,w,h]转为[x1,xy1,x2,y2]
                    box_xy = torch.FloatTensor(box.size())      # 重新创建一个变量存储值
                    box_xy[:2] = box[:2] - 0.5*box[2:] # 这个就是中心坐标加减宽度/高度得到左上角/右下角坐标
                    box_xy[2:] = box[:2] + 0.5*box[2:]
                    # 获取最大的概率和类别索引值
                    max_prob,cls_index = torch.max(pred[i,j,10:],0)
                    # 如果置信度 * 类别概率 > 0.1,即说明有一定的可信度
                    # 那么把值加入各个变量列表中
                    if float((contain_prob*max_prob)[0]) > 0.1:
    # 如果boxes为0,表示没有框,返回0
    if len(boxes) ==0:
        boxes = torch.zeros((1,4))
        probs = torch.zeros(1)
        cls_indexs = torch.zeros(1)
    # 否则,进行处理,就是简单把原来的列表值[tensor,tensor]转为tensor的形式
    # 里面的值不变
        boxes =,0) #(n,4)
        probs =,0) #(n,)
        cls_indexs =,0) #(n,)
    # 后处理——NMS
    keep = nms(boxes,probs)
    # 返回值
    return boxes[keep],cls_indexs[keep],probs[keep]

# 4. NMS处理
def nms(bboxes,scores,threshold=0.5):
    :param bboxes:  bboxes(tensor) [N,4]
    :param scores:  scores(tensor) [N,]
    :param threshold: 阈值
    :return: 返回过滤后的框
    # 获取各个框的坐标值
    x1 = bboxes[:,0]
    y1 = bboxes[:,1]
    x2 = bboxes[:,2]
    y2 = bboxes[:,3]
    # 计算面积
    areas = (x2-x1) * (y2-y1)
    # 将置信度按照降序排序,并获取排序后的各个置信度在这个顺序中的索引
    _,order = scores.sort(0,descending=True)
    keep = []
    # 判断order中的元素个数是否大于0
    while order.numel() > 0:
        # 如果元素个数只剩下一个了,结束循环
        if order.numel() == 1:
            i = order.item()
        # 获取最大置信度的索引
        i = order[0]
        # 对后面的元素坐标进行截断处理
        xx1 = x1[order[1:]].clamp(min=x1[i]) # min指的是小于它的设置为它的值,大于它的不管
        yy1 = y1[order[1:]].clamp(min=y1[i])
        xx2 = x2[order[1:]].clamp(max=x2[i])
        yy2 = y2[order[1:]].clamp(max=y2[i])
        # 此时的xx1,yy1等是排除了目前选中的框的,即假设x1有三个元素,那么xx1只有2个元素
        # 获取排序后的长和宽以及面积,如果小于0则设置为0
        w = (xx2-xx1).clamp(min=0)
        h = (yy2-yy1).clamp(min=0)
        inter = w*h

        # 准备更新order、
        # 计算选中的框和剩下框的IOU值
        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        # 如果 IOU小于设定的阈值,说明需要保存下来继续筛选(NMS原理)
        ids = (ovr<=threshold).nonzero().squeeze()
        if ids.numel() == 0:
        order = order[ids+1]
    return torch.LongTensor(keep)

# 5. 预测函数
def predict_single(model, image_name, root_path=''):
    result = []  # 保存结果的变量
    # 打开图片
    image = cv2.imread(root_path + image_name)
    h, w, _ = image.shape
    # resize为模型的输入大小,即448*448
    img = cv2.resize(image, (448, 448))
    # 由于我们模型那里定义的颜色模式为RGB,因此这里需要转换
    mean = (123, 117, 104)  # RGB均值
    img = img - np.array(mean, dtype=np.float32)
    # 预处理
    transform = transforms.Compose([transforms.ToTensor(), ])
    img = transform(img)
    img = Variable(img[None, :, :, :], volatile=True)
    img = img.cuda()
    # 开始预测
    pred = model(img)  # 1x7x7x30
    pred = pred.cpu()
    # 解码
    boxes, cls_indexs, probs = decoder(pred)
    # 开始迭代每个框
    for i, box in enumerate(boxes):
        # 获取相关坐标,只是需要把原来归一化后的坐标转回去
        x1 = int(box[0] * w)
        x2 = int(box[2] * w)
        y1 = int(box[1] * h)
        y2 = int(box[3] * h)
        # 获取类别索引、概率等值
        cls_index = cls_indexs[i]
        cls_index = int(cls_index)  # convert LongTensor to int
        prob = probs[i]
        prob = float(prob)
        # 把这些值集中放入一个变量中返回
        result.append([(x1, y1), (x2, y2), VOC_CLASSES[cls_index], image_name, prob])
    return result

if __name__ == '__main__':
    # 慢慢的显示
    import time
    # 创建模型,加载参数
    model = resnet50()
    # 设置图片路径
    # base_path = './test_images/'
    base_path = '../data/VOC2012/JPEGImages/'
    image_name_list = [base_path+i for i in os.listdir(base_path)]
    # 打乱顺序
    print('stating predicting....')
    for image_name in image_name_list:
        image = cv2.imread(image_name)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        result = predict_single(model, image_name)
        # 画矩形框和对应的类别信息
        for left_up, right_bottom, class_name, _, prob in result:
            # 获取颜色
            color = Color[VOC_CLASSES.index(class_name)]
            # 画矩形
            cv2.rectangle(image, left_up, right_bottom, color, 2)
            # 获取类型信息和对应概率,此时为str类型
            label = class_name + str(round(prob, 2))
            # 把类别和概率信息写上,还要为这个信息加上一个矩形框
            text_size, baseline = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.4, 1)
            p1 = (left_up[0], left_up[1] - text_size[1])
            cv2.rectangle(image, (p1[0] - 2 // 2, p1[1] - 2 - baseline), (p1[0] + text_size[0], p1[1] + text_size[1]),
                          color, -1)
            cv2.putText(image, label, (p1[0], p1[1] + baseline), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255, 255, 255), 1, 8)

        # 显示图片
        # 是否保存结果图片
        # cv2.imwrite('./test_images/result.jpg', image)

8. 代码与权重下载地址

​ 我把代码和权重文件放到了百度网盘,需要的朋友自取:


9. 总结

​ 之前写分类任务的时候,没有感觉到电脑的吃力,即使训练的长一点,也就一两个小时就训练完了。但是这次yolov1训练了一整天的时间,特别是训练到后期,特别担心代码哪里写错了或者电脑报错,这也给我提了一个醒,是否需要及时保存权重文件?

​ 除此之外,yolov1算目标检测中比较简单的算法了,但是实现起来还是比较麻烦,如果没有参考资料,自己从零开始实现估计更加麻烦。

​ 最后,说明一下本代码的预测结果,本次代码预测结果并不是特别好,个人认为有以下几点可以改进:

  • 首先,肯定是batch_size的大小,如果有得选,我也想把batch_size改大一点_
  • 其次,是优化器的选择和其参数的设置,本次选用的SGD,参数都是默认设置的,我在想是否改为Adam会好一点
  • 另外,本次学习率从0.01开始,分别在20和35epoch时除以10,感觉初始的学习率有点大了,是否可以减小学习率?
  • 最后,所谓微调,我们继承了ResNet50的权重,但是这个ResNet50是在尺度为224*224下训练的,而我们的目标检测将分辨率改为了448*448,是否需要先进行一定程度的微调才来训练yolov1值得思考。




