深入浅出理解深度可分离卷积（Depthwise Separable Convolution）

一、参考资料

详细且通俗讲解轻量级神经网络——MobileNets【V1、V2、V3】
详细且通俗讲解轻量级神经网络——MobileNets【V1、V2、V3】
卷积神经网络中的Separable Convolution
深度学习中常用的几种卷积（下篇）：膨胀卷积、可分离卷积（深度可分离、空间可分离）、分组卷积（附Pytorch测试代码）

二、相关介绍

1. 标准卷积

标准卷积，利用若干个多通道卷积核对输入的多通道图像进行处理，输出的feature map既提取了通道特征，又提取了空间特征。

如下图所示，假设输入层为一个大小为64×64像素、3通道彩色图片。经过一个包含4个Filter的卷积层，最终输出4个Feature Map，且尺寸与输入层相同。此时，卷积层共4个Filter，每个Filter包含了3个Kernel，每个Kernel的大小为3×3。因此，卷积层的参数量为： $N_{std} = 4 × 3 × 3 × 3 = 108$ 。
在这里插入图片描述

用数学公式表达标准卷积，假设卷积核大小为 $D_K*D_K$ ，输入通道为M，输出通道为N，输出的特征图尺寸为 $D_F*D_F$ ，则通过标准卷积之后，可以计算：

参数量为： $D_K*D_K*M*N$ ；

计算量为： $D_K*D_K*M*N*D_F*D_F$ 。

2. 逐深度卷积（Depthwise Convolution）

逐深度卷积（Depthwise convolution，DWConv）与标准卷积的区别在于，深度卷积的卷积核为单通道模式，需要对输入的每一个通道进行卷积，这样就会得到和输入特征图通道数一致的输出特征图。即有输入特征图通道数=卷积核个数=输出特征图个数。

假设，一个大小为64×64像素、3通道彩色图片，3个单通道卷积核分别进行卷积计算，输出3个单通道的特征图。所以，一个3通道的图像经过运算后生成了3个Feature map，如下图所示。其中一个Filter只包含一个大小为3×3的Kernel，卷积部分的参数量为： $N_{depthwise} = 3 × 3 × 3 = 27$ 。
在这里插入图片描述

再比如，输入12x12x3的特征图，卷积个数为3，输出8x8x3的特征图。
在这里插入图片描述

3. 逐点卷积（Pointwise Convolution）

根据深度卷积可知，输入特征图通道数=卷积核个数=输出特征图个数，这样会导致输出的特征图个数过少（或者说输出特征图的通道数过少，可看成是输出特征图个数为1，通道数为3），从而可能影响信息的有效性。此时，就需要进行逐点卷积。

逐点卷积（Pointwise Convolution，PWConv）实质上是用1x1的卷积核进行升维。在GoogleNet中大量使用1x1的卷积核，那里主要是用来降维。1x1的卷积核主要作用是对特征图进行升维和降维。

如下图所示，从深度卷积得到的3个单通道特征图，经过4个大小为1x1x3卷积核的卷积计算，输出4个特征图，而输出特征图的个数取决于Filter的个数。因此，卷积层的参数量为： $N_{pointwise} = 1 × 1 × 3 × 4 = 12$ 。
在这里插入图片描述

再比如，当进行深度卷积之后，得到8x8x3的特征，此时用256个大小为1x1x3的卷积进行卷积计算，输出为8x8x256。因此，卷积层的参数量为： $N_{pointwise} = 1 × 1 × 3 × 256 = 768$ 。
在这里插入图片描述

4. 深度可分离卷积（Depthwise Separable Convolution）

深度可分离卷积（Depthwise separable convolution, DSC）由深度卷积和逐点卷积组成，深度卷积用于提取空间特征，逐点卷积用于提取通道特征。深度可分离卷积在特征维度上分组卷积，对每个channel进行独立的深度卷积（depthwise convolution），并在输出前使用一个1x1卷积（pointwise convolution）将所有通道进行聚合。

depthwise ：在空间上进行卷积

pointwise : 在深度上进行卷积

$\textcolor{red}{深度可分离卷积（Depthwise Separable Convolution）=depthwise(深度卷积)+pointwise(点卷积)}$
在这里插入图片描述

深度可分离卷积，先对每个channel进行DWConv，然后再通过 PWConv合并所有channels为输出特征图，从而达到减小计算量、提升计算效率的目的。

4.1 参数量

逐深度卷积：深度卷积的卷积核尺寸 $D_K * D_K * 1$ ，卷积核个数为M，所以参数量为： $D_K * D_K * M$ 。

逐点卷积：逐点卷积的卷积核尺寸为 $1 * 1 * M$ ，卷积核个数为N，所以参数量为： $M * N$ 。

因此，深度可分离卷积的参数量为： $D_K * D_K * M + M * N$ 。

4.2 计算量

深度卷积：深度卷积的卷积核尺寸 $D_K * D_K * 1$ ，卷积核个数为M，每个都要做 $D_F * D_F$ 次乘加运算，所以计算量为： $D_K * D_K * M * D_F * D_F$ 。

逐点卷积：逐点卷积的卷积核尺寸为 $1 * 1 * M$ ，卷积核个数为N，每个都要做 $D_F * D_F$ 次乘加运算，所以计算量为： $M*N * D_F * D_F$ 。

因此，深度可分离卷积的计算量为： $D_K * D_K * M*D_F * D_F + M*N*D_F * D_F$ 。

4.3 与标准卷积对比

4.3.1 结构对比

深度可分离卷积的每个块构成：首先是一个3x3的深度卷积，其次是BN、Relu层，接下来是1x1的逐点卷积，最后又是BN和Relu层。
在这里插入图片描述

4.3.2 计算量和参数量对比

参数量比值： $\frac{深度可分离卷积}{标准卷积} = \frac{D_K\times D_K\times M+M\times N}{D_K\times D_K\times M\times N} = \frac{1}{N}+\frac{1}{{D_{K}}^{2}}$ ；

计算量比值： $\frac{深度可分离卷积}{标准卷积} = \frac{D_{K}\times D_{K}\times M\times D_{F}\times D_{F}+M\times N\times D_{F}\times D_{F}}{D_{K}\times D_{K}\times M\times N\times D_{F}\times D_{F}} = \frac{1}{N}+\frac{1}{{D_{K}}^{2}}$ 。
在这里插入图片描述

一般的，N较大， $\frac{1}{N}$ 可忽略不计， $D_K$ 表示卷积核的大小，若 $D_K =3$ ， $\frac{1}{D_k^2}=\frac{1}{9}$ 。也就是说，如果我们使用常见的3×3的卷积核，那么使用深度可分离卷积的参数量和计算量下降到原来的九分之一左右。

4.4 深度可分离卷积的优势

相比于传统的卷积神经网络，深度可分离卷积的显著优势在于：

更少的参数：可减少输入通道数量，从而有效地减少卷积层所需的参数。
更快的速度：运行速度比传统卷积快。
更加易于移植：计算量更小，更易于实现和部署在不同的平台上。
更加精简：能够精简计算模型，从而在较小的设备上实现高精度的运算。

5. 代码示例

import torch.nn as nn

class myModel(nn.Module):
    def __init__(self):
        super(myModel, self).__init__()
        self.dwconv = nn.Sequential(
            nn.Conv2d(3, 3, kernel_size=3, stride=2, padding=1, groups=3, bias=False),
            nn.BatchNorm2d(3),
            nn.ReLU(inplace=True),
            nn.Conv2d(3, 9, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm2d(9),
            nn.ReLU(inplace=True),
        )

yolo系列里的代码如下：

import torch.nn as nn


class DWConv(nn.Module):
    """Depthwise Conv + Conv"""
    def __init__(self, in_channels, out_channels, ksize, stride=1, act="silu"):
        super().__init__()
        self.dconv = BaseConv(
            in_channels, in_channels, ksize=ksize,
            stride=stride, groups=in_channels, act=act
        )
        self.pconv = BaseConv(
            in_channels, out_channels, ksize=1,
            stride=1, groups=1, act=act
        )
 
    def forward(self, x):
        x = self.dconv(x)
        return self.pconv(x)

三、MobileNet v1

论文：MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNet v1 和 MobileNet v2
MobileNet网络详解
【深度学习】轻量化CNN网络MobileNet系列详解
MobileNet V1 图像分类

MobileNet v1是专注于移动端或者嵌入式设备这种计算量不是特别大的轻量级CNN网络。如图下图所示，MobileNet v1只是牺牲了一点精度，却大大减少模型的参数量和运算量。
在这里插入图片描述

1. 网络结构

MobileNet v1最主要的贡献是使用了 Depthwise Separable Convolution，它又可以拆分成Depthwise卷积和Pointwise卷积。Depthwise Separable Convolution 的分离式设计直接将模型压缩了8倍左右，但是精度并没有损失非常严重，这一点还是非常震撼的。
在这里插入图片描述

轻量级网络的MobileNet v1计算量和参数量均小于GoogleNet，同时在分类效果上比GoogleNet还要好，这就是深度可分离卷积的功劳了。VGG16的计算量参数量比MobileNet大了30倍，但是结果也仅仅只高了1%不到。
在这里插入图片描述

2. 优势

首先，MobileNet v1提出的深度可分离卷积可以大大减少计算量和参数量；其次，就是增加超参数α、ρ可以根据需求调节网络的宽度和分辨率。

具体来说，超参数α是为了控制卷积核的个数，也就是输出的channel，因此α可以减少模型的参数量；超参数ρ是为了控制图像输入的size，是不会影响模型的参数，但是可以减少计算量。

网络的宽度，代表卷积层的维度，也就是channel，例如512，1024。

网络的深度，代表卷积层的层数，也就是网络有多深，例如resnet34、resnet101。

3. 代码实现

3.1 搭建MobileNet v1网络模型

import torch.nn as nn
 
 
# MobileNet v1
class MobileNetV1(nn.Module):
    def __init__(self,num_classes=1000):
        super(MobileNetV1, self).__init__()
 
        # 第一层的卷积,channel->32,size减半
        def conv_bn(in_channel, out_channel, stride):
            return nn.Sequential(
                nn.Conv2d(in_channel, out_channel, 3, stride, 1, bias=False),
                nn.BatchNorm2d(out_channel),
                nn.ReLU(inplace=True)
            )
 
        # 深度可分离卷积=depthwise卷积 + pointwise卷积
        def conv_dw(in_channel, out_channel, stride):
            return nn.Sequential(
                # depthwise 卷积,channel不变，stride = 2的时候，size减半
                nn.Conv2d(in_channel, in_channel, 3, stride, padding=1, groups=in_channel, bias=False),
                nn.BatchNorm2d(in_channel),
                nn.ReLU(inplace=True),
 
                # pointwise卷积(1*1卷积) same卷积, 只改变channel
                nn.Conv2d(in_channel, out_channel, 1, 1, padding=0, bias=False),
                nn.BatchNorm2d(out_channel),
                nn.ReLU(inplace=True),
            )
 
        self.model = nn.Sequential(
            conv_bn(3, 32, 2),          # conv/s2           out=224*224*32
            conv_dw(32, 64, 1),         # conv dw +1*1      out=112*112*64
            conv_dw(64, 128, 2),        # conv dw +1*1      out=56*56*128
            conv_dw(128, 128, 1),       # conv dw +1*1      out=56*56*128
            conv_dw(128, 256, 2),       # conv dw +1*1      out=28*28*256
            conv_dw(256, 256, 1),       # conv dw +1*1      out=28*28*256
            conv_dw(256, 512, 2),       # conv dw +1*1      out=14*14*512
            conv_dw(512, 512, 1),       # 5个 conv dw +1*1 ----> size不变，channel不变，out=14*14*512
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 1024, 2),      # conv dw +1*1      out=7*7*1024
            conv_dw(1024, 1024, 1),     # conv dw +1*1      out=7*7*1024
            nn.AvgPool2d(7),            # avg pool          out=1*1*1024
        )
        self.fc = nn.Linear(1024, num_classes)      # fc
 
    def forward(self, x):
        x = self.model(x)
        x = x.view(-1, 1024)
        x = self.fc(x)
        return x

3.2 `torchsummary`查看网络结构

# 安装torchsummary
pip install torchsummary

使用 torchsummary 查看网络结构：

from torchsummary import summary
import torch
 
 
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
net = MobileNetV1()
net.to(DEVICE)
print(summary(net, input_size=(3, 224, 224),device=DEVICE))

输出结果：

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 32, 112, 112]             864
       BatchNorm2d-2         [-1, 32, 112, 112]              64
              ReLU-3         [-1, 32, 112, 112]               0
            Conv2d-4         [-1, 32, 112, 112]             288
       BatchNorm2d-5         [-1, 32, 112, 112]              64
              ReLU-6         [-1, 32, 112, 112]               0
            Conv2d-7         [-1, 64, 112, 112]           2,048
       BatchNorm2d-8         [-1, 64, 112, 112]             128
              ReLU-9         [-1, 64, 112, 112]               0
           Conv2d-10           [-1, 64, 56, 56]             576
      BatchNorm2d-11           [-1, 64, 56, 56]             128
             ReLU-12           [-1, 64, 56, 56]               0
           Conv2d-13          [-1, 128, 56, 56]           8,192
      BatchNorm2d-14          [-1, 128, 56, 56]             256
             ReLU-15          [-1, 128, 56, 56]               0
           Conv2d-16          [-1, 128, 56, 56]           1,152
      BatchNorm2d-17          [-1, 128, 56, 56]             256
             ReLU-18          [-1, 128, 56, 56]               0
           Conv2d-19          [-1, 128, 56, 56]          16,384
      BatchNorm2d-20          [-1, 128, 56, 56]             256
             ReLU-21          [-1, 128, 56, 56]               0
           Conv2d-22          [-1, 128, 28, 28]           1,152
      BatchNorm2d-23          [-1, 128, 28, 28]             256
             ReLU-24          [-1, 128, 28, 28]               0
           Conv2d-25          [-1, 256, 28, 28]          32,768
      BatchNorm2d-26          [-1, 256, 28, 28]             512
             ReLU-27          [-1, 256, 28, 28]               0
           Conv2d-28          [-1, 256, 28, 28]           2,304
      BatchNorm2d-29          [-1, 256, 28, 28]             512
             ReLU-30          [-1, 256, 28, 28]               0
           Conv2d-31          [-1, 256, 28, 28]          65,536
      BatchNorm2d-32          [-1, 256, 28, 28]             512
             ReLU-33          [-1, 256, 28, 28]               0
           Conv2d-34          [-1, 256, 14, 14]           2,304
      BatchNorm2d-35          [-1, 256, 14, 14]             512
             ReLU-36          [-1, 256, 14, 14]               0
           Conv2d-37          [-1, 512, 14, 14]         131,072
      BatchNorm2d-38          [-1, 512, 14, 14]           1,024
             ReLU-39          [-1, 512, 14, 14]               0
           Conv2d-40          [-1, 512, 14, 14]           4,608
      BatchNorm2d-41          [-1, 512, 14, 14]           1,024
             ReLU-42          [-1, 512, 14, 14]               0
           Conv2d-43          [-1, 512, 14, 14]         262,144
      BatchNorm2d-44          [-1, 512, 14, 14]           1,024
             ReLU-45          [-1, 512, 14, 14]               0
           Conv2d-46          [-1, 512, 14, 14]           4,608
      BatchNorm2d-47          [-1, 512, 14, 14]           1,024
             ReLU-48          [-1, 512, 14, 14]               0
           Conv2d-49          [-1, 512, 14, 14]         262,144
      BatchNorm2d-50          [-1, 512, 14, 14]           1,024
             ReLU-51          [-1, 512, 14, 14]               0
           Conv2d-52          [-1, 512, 14, 14]           4,608
      BatchNorm2d-53          [-1, 512, 14, 14]           1,024
             ReLU-54          [-1, 512, 14, 14]               0
           Conv2d-55          [-1, 512, 14, 14]         262,144
      BatchNorm2d-56          [-1, 512, 14, 14]           1,024
             ReLU-57          [-1, 512, 14, 14]               0
           Conv2d-58          [-1, 512, 14, 14]           4,608
      BatchNorm2d-59          [-1, 512, 14, 14]           1,024
             ReLU-60          [-1, 512, 14, 14]               0
           Conv2d-61          [-1, 512, 14, 14]         262,144
      BatchNorm2d-62          [-1, 512, 14, 14]           1,024
             ReLU-63          [-1, 512, 14, 14]               0
           Conv2d-64          [-1, 512, 14, 14]           4,608
      BatchNorm2d-65          [-1, 512, 14, 14]           1,024
             ReLU-66          [-1, 512, 14, 14]               0
           Conv2d-67          [-1, 512, 14, 14]         262,144
      BatchNorm2d-68          [-1, 512, 14, 14]           1,024
             ReLU-69          [-1, 512, 14, 14]               0
           Conv2d-70            [-1, 512, 7, 7]           4,608
      BatchNorm2d-71            [-1, 512, 7, 7]           1,024
             ReLU-72            [-1, 512, 7, 7]               0
           Conv2d-73           [-1, 1024, 7, 7]         524,288
      BatchNorm2d-74           [-1, 1024, 7, 7]           2,048
             ReLU-75           [-1, 1024, 7, 7]               0
           Conv2d-76           [-1, 1024, 7, 7]           9,216
      BatchNorm2d-77           [-1, 1024, 7, 7]           2,048
             ReLU-78           [-1, 1024, 7, 7]               0
           Conv2d-79           [-1, 1024, 7, 7]       1,048,576
      BatchNorm2d-80           [-1, 1024, 7, 7]           2,048
             ReLU-81           [-1, 1024, 7, 7]               0
        AvgPool2d-82           [-1, 1024, 1, 1]               0
           Linear-83                 [-1, 1000]       1,025,000
================================================================
Total params: 4,231,976
Trainable params: 4,231,976
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 115.43
Params size (MB): 16.14
Estimated Total Size (MB): 132.15
----------------------------------------------------------------
None

3.3 train训练模型

import torch
import torch.nn as nn
from torchvision import transforms, datasets
import torch.optim as optim
from model import MobileNetV1
from torch.utils.data import DataLoader
from tqdm import tqdm
 
 
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
data_transform = {
    "train" : transforms.Compose([transforms.Resize((224,224)),
                                  transforms.ToTensor(),
                                  transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.255])]),
    "test": transforms.Compose([transforms.Resize((224,224)),
                                 transforms.ToTensor(),
                                 transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])])}
 
# 训练集
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=data_transform['train'])
trainloader = DataLoader(trainset, batch_size=16, shuffle=True)
 
# 测试集
testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=data_transform['test'])
testloader = DataLoader(testset, batch_size=16, shuffle=False)
 
# 样本的个数
num_trainset = len(trainset)  # 50000
num_testset = len(testset)    # 10000
 
# 构建网络
net =MobileNetV1(num_classes=10)
net.to(DEVICE)
 
# 加载损失和优化器
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.0001)
 
best_acc = 0.0
save_path = './MobileNetV1.pth'
 
for epoch in range(10):
    net.train()     # 训练模式
    running_loss = 0.0
    for data in tqdm(trainloader):
        images, labels = data
        images, labels = images.to(DEVICE), labels.to(DEVICE)
 
        optimizer.zero_grad()
        out = net(images)  # 总共有三个输出
        loss = loss_function(out,labels)
        loss.backward()  # 反向传播
        optimizer.step()
 
        running_loss += loss.item()
 
    # test
    net.eval()      # 测试模式
    acc = 0.0
    with torch.no_grad():
        for test_data in tqdm(testloader):
            test_images, test_labels = test_data
            test_images, test_labels = test_images.to(DEVICE), test_labels.to(DEVICE)
 
            outputs = net(test_images)
            predict_y = torch.max(outputs, dim=1)[1]
            acc += (predict_y == test_labels).sum().item()
 
    accurate = acc / num_testset
    train_loss = running_loss / num_trainset
 
    print('[epoch %d] train_loss: %.3f  test_accuracy: %.3f' %
          (epoch + 1, train_loss, accurate))
 
    if accurate > best_acc:
        best_acc = accurate
        torch.save(net.state_dict(), save_path)
 
print('Finished Training')

输出结果：

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
170499072it [00:30, 5634555.25it/s]                                
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
100%|██████████| 3125/3125 [02:18<00:00, 22.55it/s]
100%|██████████| 625/625 [00:12<00:00, 51.10it/s]
[epoch 1] train_loss: 0.101  test_accuracy: 0.516
100%|██████████| 3125/3125 [02:23<00:00, 21.78it/s]
100%|██████████| 625/625 [00:11<00:00, 54.31it/s]
[epoch 2] train_loss: 0.079  test_accuracy: 0.612
100%|██████████| 3125/3125 [02:20<00:00, 22.17it/s]
100%|██████████| 625/625 [00:11<00:00, 54.28it/s]
[epoch 3] train_loss: 0.066  test_accuracy: 0.672
100%|██████████| 3125/3125 [02:21<00:00, 22.09it/s]
100%|██████████| 625/625 [00:11<00:00, 55.52it/s]
[epoch 4] train_loss: 0.056  test_accuracy: 0.722
100%|██████████| 3125/3125 [02:13<00:00, 23.34it/s]
100%|██████████| 625/625 [00:11<00:00, 55.56it/s]
[epoch 5] train_loss: 0.048  test_accuracy: 0.748
100%|██████████| 3125/3125 [02:14<00:00, 23.31it/s]
100%|██████████| 625/625 [00:11<00:00, 52.19it/s]
[epoch 6] train_loss: 0.042  test_accuracy: 0.763
100%|██████████| 3125/3125 [02:14<00:00, 23.18it/s]
100%|██████████| 625/625 [00:11<00:00, 56.05it/s]
[epoch 7] train_loss: 0.035  test_accuracy: 0.781
100%|██████████| 3125/3125 [02:14<00:00, 23.27it/s]
100%|██████████| 625/625 [00:11<00:00, 55.88it/s]
[epoch 8] train_loss: 0.031  test_accuracy: 0.790
100%|██████████| 3125/3125 [02:13<00:00, 23.32it/s]
100%|██████████| 625/625 [00:11<00:00, 55.89it/s]
[epoch 9] train_loss: 0.026  test_accuracy: 0.801
100%|██████████| 3125/3125 [02:15<00:00, 22.99it/s]
100%|██████████| 625/625 [00:11<00:00, 55.95it/s]
[epoch 10] train_loss: 0.022  test_accuracy: 0.803
Finished Training

Process finished with exit code 0

显卡资源占用情况：
在这里插入图片描述

3.4 查看模型权重参数

from model import MobileNetV1
import torch
 
 
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
 
net = MobileNetV1(num_classes=10)
net.load_state_dict(torch.load('./MobileNetV1.pth'))
net.to(DEVICE)
 
 
with torch.no_grad():
    for i in range(0,14):       # 查看 depthwise 的权值
        print(net.model[i][0].weight)

3.5 在CIFAR10数据集上测试效果

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
 
import torch
import numpy as np
import matplotlib.pyplot as plt
from model import MobileNetV1
from torchvision.transforms import transforms
from torch.utils.data import DataLoader
import torchvision
 
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
 
# 预处理
transformer = transforms.Compose([transforms.Resize((224,224)),
                                  transforms.ToTensor(),
                                  transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])])
 
# 加载模型
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MobileNetV1(num_classes=10)
model.load_state_dict(torch.load('./MobileNetV1.pth'))
model.to(DEVICE)
 
# 加载数据
testSet = torchvision.datasets.CIFAR10(root='./data', train=False, download=False, transform=transformer)
testLoader = DataLoader(testSet, batch_size=12, shuffle=True)
 
# 获取一批数据
imgs, labels = next(iter(testLoader))
imgs = imgs.to(DEVICE)
 
# show
with torch.no_grad():
    model.eval()
    prediction = model(imgs)  # 预测
    prediction = torch.max(prediction, dim=1)[1]
    prediction = prediction.data.cpu().numpy()
 
    plt.figure(figsize=(12, 8))
    for i, (img, label) in enumerate(zip(imgs, labels)):
        x = np.transpose(img.data.cpu().numpy(), (1, 2, 0))  # 图像
        x[:, :, 0] = x[:, :, 0] * 0.229 + 0.485  # 去 normalization
        x[:, :, 1] = x[:, :, 1] * 0.224 + 0.456  # 去 normalization
        x[:, :, 2] = x[:, :, 2] * 0.255 + 0.406  # 去 normalization
        y = label.numpy().item()  # label
        plt.subplot(3, 4, i + 1)
        plt.axis(False)
        plt.imshow(x)
        plt.title('R:{},P:{}'.format(classes[y], classes[prediction[i]]))
    plt.show()

结果展示：

在这里插入图片描述

四、MobileNet v2

论文：MobileNetV2: Inverted Residuals and Linear Bottlenecks

0. 引言

特征图的每个通道的像素值所代表的特征可以映射到一个低维子空间的流形区域上。通常，在进行卷积操作之后往往会接一层激活层，以此增加特征的非线性，一个常见的激活函数就是 ReLU。激活过程会带来信息损耗，而且这种损耗是无法恢复的，当通道数非常少时，ReLU的信息损耗更为明显。

如下图所示，其输入是一个表示流形数据的矩阵，和卷积操作类似，经过 n 个ReLU的操作得到 n 个通道的Feature Map，然后通过 n 个Feature Map还原输入数据，还原的越像说明信息损耗的越少。
在这里插入图片描述

从上图可以看出，在输入维度是2，3时，输出和输入相比丢失了较多信息；但是在输入维度是15到30时，输出则保留了输入的较多信息。总得来说，当n值较小时，ReLU的信息损耗非常严重，当n值较大时，输入流形能较好还原。

根据对上面提到的信息损耗问题分析，我们可以有两种解决方案：

替换ReLU：既然是ReLU导致的信息损耗，那么可以将ReLU替换成线性激活函数；
提高维度：如果比较多的通道数能减少信息损耗，那么可以通过升维将输入的维度变高。

MobileNet v2的题目为 MobileNetV2: Inverted Residuals and Linear Bottlenecks ， Linear Bottlenecks 和 Inverted Residuals 就是MobileNet v2的核心，也是上述所说两种思路的描述。

1. `Linear Bottlenecks`

将Relu激活函数替换成线性激活函数，文章中将变换后的块称为 Linear Bottlenecks，结构如下图所示：
在这里插入图片描述

当然不能把ReLU全部换成线性激活函数，不然网络将会退化为单层神经网络，一个折中方案是在输出Feature Map的通道数较少的时候，也就是bottleneck部分使用线性激活函数，其它时候使用ReLU。Linear Bottlenecks 块的代码实现如下：

def _bottleneck(inputs, nb_filters, t):
    x = Conv2D(filters=nb_filters * t, kernel_size=(1,1), padding='same')(inputs)
    x = Activation(relu6)(x)
    x = DepthwiseConv2D(kernel_size=(3,3), padding='same')(x)
    x = Activation(relu6)(x)
    x = Conv2D(filters=nb_filters, kernel_size=(1,1), padding='same')(x)
    # do not use activation function
    if not K.get_variable_shape(inputs)[3] == nb_filters:
        inputs = Conv2D(filters=nb_filters, kernel_size=(1,1), padding='same')(inputs)
    outputs = add([x, inputs])
    return outputs

2. `Inverted Residual`

Inverted Residuals直译为倒残差结构，我们来看看其与正常的残差结构有什么区别和联系：通过下图可以看出，左侧为ResNet中的残差结构，其结构为：1x1卷积降维->3x3卷积->1x1卷积升维；右侧为MobileNet v2中的倒残差结构，其结构为：1x1卷积升维->3x3DW卷积->1x1卷积降维。MobileNet v2先使用1x1进行升维的原因是：高维信息通过ReLU激活函数后丢失的信息更少，因此先进行升维操作。
在这里插入图片描述

这部分需要注意的是只有当s=1,即步长为1时，才有shortcut连接，步长为2是没有的，如下图所示。

在这里插入图片描述

3. 网络结构

在这里插入图片描述

MobileNet v2所用的参数更少，但mAP值和其它的差不多，甚至超过了Yolov2，其效果如下图所示：
在这里插入图片描述

4. 代码实现

MobileNet v2的实现可以通过堆叠bottleneck的形式实现，如下面代码片段：

def MobileNetV2_relu(input_shape, k):
    inputs = Input(shape = input_shape)
    x = Conv2D(filters=32, kernel_size=(3,3), padding='same')(inputs)
    x = _bottleneck_relu(x, 8, 6)
    x = MaxPooling2D((2,2))(x)
    x = _bottleneck_relu(x, 16, 6)
    x = _bottleneck_relu(x, 16, 6)
    x = MaxPooling2D((2,2))(x)
    x = _bottleneck_relu(x, 32, 6)
    x = GlobalAveragePooling2D()(x)
    x = Dense(128, activation='relu')(x)
    outputs = Dense(k, activation='softmax')(x)
    model = Model(inputs, outputs)
    return model