经典轻量级神经网络(3)ShuffleNet V1及其在Fashion-MNIST数据集上的应用

1 ShuffleNet V1的简述

ShuffleNet 提出了 1x1分组卷积+通道混洗 的策略，在保证准确率的同时大幅降低计算成本。

ShuffleNet 专为计算能力有限的设备（如：10~150MFLOPs）设计。在基于ARM 的移动设备上，ShuffleNet 与AlexNet 相比，在保持相当的准确率的同时，大约 13 倍的加速

论文地址: https://arxiv.org/pdf/1707.01083.pdf

1.1 ShuffleNet的创新点

1.1.1 背景介绍

建立更深、更大的卷积神经网络CNN是解决视觉识别任务的主要趋势。但通常具有数百个层和数千个通道，因此需要极大算力消耗。
无人机，机器人和智能手机等常见的移动平台需要在非常有限的算力支撑下尽可能的提升准确率。

因此，旷视科技针对计算能力非常有限的移动设备，设计引入计算效率极高的CNN架构ShuffleNet。通过分组点卷积和通道重排两点创新，在保持精度的同时大大降低了计算成本。

1.1.2 分组点卷积(Pointwise group convolution)

在Xception 和ResNeXt 中，有大量的1x1 卷积，所以整体而言1x1 卷积的计算开销较大。如ResNeXt 的每个残差块中，1x1 卷积占据了乘-加运算的 93.4% （基数为32时）。

在小型网络中，为了满足计算性能的约束（因为计算资源不够）需要控制计算量。虽然限制通道数量可以降低计算量，但这也可能会严重降低准确率。

解决办法是：对1x1 卷积应用分组卷积，将每个 1x1 卷积仅仅在相应的通道分组上操作，这样就可以降低每个1x1 卷积的计算代价。

在这里插入图片描述

论文的分组点卷积也就是说又点卷积又分组。也就是之前的分组卷积分析中，将卷积核的大小变为1×1的卷积。但是这里的点卷积就不是贯通输入全通道了，而是贯通分组后的每一组的全通道。

1.1.3 通道重排(Channel shuffle)

1x1 卷积仅在相应的通道分组上操作会带来一个副作用：每个通道的输出仅仅与该通道所在分组的输入（一般占总输入的比例较小）有关，与其它分组的输入（一般占总输入的比例较大）无关。这会阻止通道之间的信息流动，降低网络的表达能力。

解决办法是：采用通道混洗，允许分组卷积从不同分组中获取输入。

如下图所示：(a) 表示没有通道混洗的分组卷积；(b) 表示进行通道混洗的分组卷积；(c) 为(b) 的等效表示。
由于通道混洗是可微的，因此它可以嵌入到网络中以进行端到端的训练。

在这里插入图片描述

1.1.4 `ResNeXt` 块和`ShuffleNet` 块

ShuffleNet 块的结构从ResNeXt 块改进而来：下图中(a) 是一个ResNeXt 块，(b) 是一个 ShuffleNet 块，(c) 是一个步长为2 的 ShuffleNet 块。

在 ShuffleNet 块中：

第一个1x1 卷积替换为1x1 分组卷积+通道随机混洗。
第二个1x1 卷积替换为1x1 分组卷积，但是并没有附加通道随机混洗。这是为了简单起见，因为不附加通道随机混洗已经有了很好的结果。
在3x3 depthwise 卷积之后只有BN 而没有ReLU 。
当步长为2时：
- 恒等映射直连替换为一个尺寸为 3x3 、步长为2 的平均池化。
- 3x3 depthwise 卷积的步长为2 。
- 将残差部分与直连部分的feature map 拼接，而不是相加。
  
  因为当feature map 减半时，为了缓解信息丢失需要将输出通道数加倍从而保持模型的有效容量。

在这里插入图片描述

1.2 ShuffleNet网络性能

1.2.1 ShuffleNet网络结构

ShuffleNet 网络由ShuffleNet 块组成。

网络主要由三个Stage 组成。
- 每个Stage 的第一个块的步长为 2 ，stage 内的块在其它参数上相同。
- 每个Stage 的输出通道数翻倍。
在 Stage2 的第一个1x1 卷积并不执行分组卷积，因此此时的输入通道数相对较小。
每个ShuffleNet 块中的第一个1x1 分组卷积的输出通道数为：该块的输出通道数的 1/4 。
使用较少的数据集增强，因为这一类小模型更多的是遇到欠拟合而不是过拟合。
复杂度给出了计算量（乘-加运算），KSize 给出了卷积核的尺寸，Stride 给出了ShuffleNet block 的步长，Repeat 给出了 ShuffleNet block 重复的次数，g 控制了ShuffleNet block 分组的数量。

g=1 时，1x1 的通道分组卷积退化回原始的1x1 卷积。

在这里插入图片描述

1.2.2 超参数g的影响

超参数 g 会影响模型的准确率和计算量。在 ImageNet 测试集上的表现如下：

ShuffleNet sx 表示对ShuffleNet 的通道数增加到 s 倍。这通过控制 Conv1 卷积的输出通道数来实现。
g 越大，则计算量越小，模型越准确。

其背后的原理是：小模型的表达能力不足，通道数量越大，则小模型的表达能力越强。
- g 越大，则准确率越高。这是因为对于ShuffleNet，分组越大则生成的feature map 通道数量越多，模型表达能力越强。
- 网络的通道数越小（如ShuffleNet 0.25x ），则这种增益越大。
随着分组越来越大，准确率饱和甚至下降。

这是因为随着分组数量增大，每个组内的通道数量变少。虽然总体通道数增加，但是每个分组提取有效信息的能力变弱，降低了模型整体的表征能力。
虽然较大g 的ShuffleNet 通常具有更好的准确率。但是由于它的实现效率较低，会带来较大的推断时间。

在这里插入图片描述

1.2.3 混洗的效果

通道随机混洗的效果要比不混洗好。在 ImageNet 测试集上的表现如下：

通道混洗使得分组卷积中，信息能够跨分组流动。
分组数g 越大，这种混洗带来的好处越大。

在这里插入图片描述

1.2.4 在移动设备上的推断时间

在移动设备上的推断时间（Qualcomm Snapdragon 820 processor，单线程）：

在骁龙820的处理器上，shuffleNet 0.5x 的错误率与 AlexNet 相当，但是推理速度为原来的十倍，通过对比能发现它有多么轻量级。suffleNet 2x 在推理速度和 MobileNet V1 相当的情况下，准确率要更好。

在这里插入图片描述

1.3 ShuffleNet代码实现

1.3.1 Shuffle的实现

shufflenet中代码比较有趣的点就在于shuffle的实现，实际上是利用了张量维度的变化来实现这一功能。

具体来说，对于下面这个图来讲，现在是channels分成了三组，每个组里有四个小球，于是就通过先reshape再转置再展开的方式来实现最后的目的。

在这里插入图片描述

对于通道数为c的输入，先将其拆分为g组，每组有n各通道。也就是说c=g×n。

进行拆分 reshape (c=g * n) -> (g, n)
进行转置 transpose (g, n) -> (n, g)
将其展开 flatten

具体的shuffle步骤的代码：

def channel_shuffle(x, groups):
    batchsize, num_channels, height, width = x.data.size()

    channels_per_group = num_channels // groups  # groups是分的组数

    # reshape
    x = x.view(batchsize, groups,channels_per_group, height, width)

    # transpose
    # - contiguous() required if transpose() is used before view().
    x = torch.transpose(x, 1, 2).contiguous()

    # flatten
    x = x.view(batchsize, -1, height, width)

    return x

1.3.2 ShuffleNet 块

在这里插入图片描述

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from collections import OrderedDict
from torch.nn import init


def conv3x3(in_channels, out_channels, stride=1,padding=1, bias=True, groups=1):
    """3x3 convolution with padding
    """
    return nn.Conv2d(
        in_channels,
        out_channels,
        kernel_size=3,
        stride=stride,
        padding=padding,
        bias=bias,
        groups=groups)


def conv1x1(in_channels, out_channels, groups=1):
    """1x1 convolution with padding
    - Normal pointwise convolution When groups == 1
    - Grouped pointwise convolution when groups > 1
    """
    return nn.Conv2d(
        in_channels,
        out_channels,
        kernel_size=1,
        groups=groups,
        stride=1)


class ShuffleUnit(nn.Module):
    def __init__(self, in_channels, out_channels, groups=3,
                 grouped_conv=True, combine='add'):

        super(ShuffleUnit, self).__init__()

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.grouped_conv = grouped_conv
        self.combine = combine
        self.groups = groups
        self.bottleneck_channels = self.out_channels // 4

        # define the type of ShuffleUnit
        if self.combine == 'add':
            # ShuffleUnit Figure 2b
            self.depthwise_stride = 1
            self._combine_func = self._add
        elif self.combine == 'concat':
            # ShuffleUnit Figure 2c
            self.depthwise_stride = 2
            self._combine_func = self._concat

            # ensure output of concat has the same channels as
            # original output channels.
            self.out_channels -= self.in_channels
        else:
            raise ValueError("Cannot combine tensors with \"{}\"" \
                             "Only \"add\" and \"concat\" are" \
                             "supported".format(self.combine))

        # Use a 1x1 grouped or non-grouped convolution to reduce input channels
        # to bottleneck channels, as in a ResNet bottleneck module.
        # NOTE: Do not use group convolution for the first conv1x1 in Stage 2.
        self.first_1x1_groups = self.groups if grouped_conv else 1

        self.g_conv_1x1_compress = self._make_grouped_conv1x1(
            self.in_channels,
            self.bottleneck_channels,
            self.first_1x1_groups,
            batch_norm=True,
            relu=True
        )

        # 3x3 depthwise convolution followed by batch normalization
        self.depthwise_conv3x3 = conv3x3(
            self.bottleneck_channels, self.bottleneck_channels,
            stride=self.depthwise_stride, groups=self.bottleneck_channels)
        self.bn_after_depthwise = nn.BatchNorm2d(self.bottleneck_channels)

        # Use 1x1 grouped convolution to expand from
        # bottleneck_channels to out_channels
        self.g_conv_1x1_expand = self._make_grouped_conv1x1(
            self.bottleneck_channels,
            self.out_channels,
            self.groups,
            batch_norm=True,
            relu=False
        )

    @staticmethod
    def _add(x, out):
        # residual connection
        return x + out

    @staticmethod
    def _concat(x, out):
        # concatenate along channel axis
        return torch.cat((x, out), 1)

    def _make_grouped_conv1x1(self, in_channels, out_channels, groups,
                              batch_norm=True, relu=False):

        modules = OrderedDict()

        conv = conv1x1(in_channels, out_channels, groups=groups)
        modules['conv1x1'] = conv

        if batch_norm:
            modules['batch_norm'] = nn.BatchNorm2d(out_channels)
        if relu:
            modules['relu'] = nn.ReLU()
        if len(modules) > 1:
            return nn.Sequential(modules)
        else:
            return conv

    def forward(self, x):
        # save for combining later with output
        residual = x

        if self.combine == 'concat':
            residual = F.avg_pool2d(residual, kernel_size=3,
                                    stride=2, padding=1)

        out = self.g_conv_1x1_compress(x)
        out = channel_shuffle(out, self.groups)
        out = self.depthwise_conv3x3(out)
        out = self.bn_after_depthwise(out)
        out = self.g_conv_1x1_expand(out)

        out = self._combine_func(residual, out)
        return F.relu(out)

1.3.3 ShuffleNet实现

class ShuffleNetV1(nn.Module):
    """ShuffleNet implementation.
    """

    def __init__(self, groups = 3, in_channels=3, num_classes=1000):
        """ShuffleNet constructor.

        Arguments:
            groups (int, optional): number of groups to be used in grouped
                1x1 convolutions in each ShuffleUnit. Default is 3 for best
                performance according to original paper.
            in_channels (int, optional): number of channels in the input tensor.
                Default is 3 for RGB image inputs.
            num_classes (int, optional): number of classes to predict. Default
                is 1000 for ImageNet.

        """
        super(ShuffleNetV1, self).__init__()

        self.groups = groups
        self.stage_repeats = [3, 7, 3]
        self.in_channels = in_channels
        self.num_classes = num_classes

        # index 0 is invalid and should never be called.
        # only used for indexing convenience.
        if groups == 1:
            self.stage_out_channels = [-1, 24, 144, 288, 567]
        elif groups == 2:
            self.stage_out_channels = [-1, 24, 200, 400, 800]
        elif groups == 3:
            self.stage_out_channels = [-1, 24, 240, 480, 960]
        elif groups == 4:
            self.stage_out_channels = [-1, 24, 272, 544, 1088]
        elif groups == 8:
            self.stage_out_channels = [-1, 24, 384, 768, 1536]
        else:
            raise ValueError(
                """{} groups is not supported for 1x1 Grouped Convolutions""".format(groups))

        # Stage 1 always has 24 output channels
        self.conv1 = conv3x3(self.in_channels,
                             self.stage_out_channels[1],  # stage 1
                             stride=2)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # Stage 2
        self.stage2 = self._make_stage(2)
        # Stage 3
        self.stage3 = self._make_stage(3)
        # Stage 4
        self.stage4 = self._make_stage(4)

        # Global pooling:
        # Undefined as PyTorch's functional API can be used for on-the-fly
        # shape inference if input size is not ImageNet's 224x224

        # Fully-connected classification layer
        num_inputs = self.stage_out_channels[-1]
        self.fc = nn.Linear(num_inputs, self.num_classes)
        self.init_params()

    def init_params(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant(m.weight, 1)
                init.constant(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant(m.bias, 0)

    def _make_stage(self, stage):
        modules = OrderedDict()
        stage_name = "ShuffleUnit_Stage{}".format(stage)

        # First ShuffleUnit in the stage
        # 1. non-grouped 1x1 convolution (i.e. pointwise convolution)
        #   is used in Stage 2. Group convolutions used everywhere else.
        grouped_conv = stage > 2

        # 2. concatenation unit is always used.
        first_module = ShuffleUnit(
            self.stage_out_channels[stage - 1],
            self.stage_out_channels[stage],
            groups=self.groups,
            grouped_conv=grouped_conv,
            combine='concat'
        )
        modules[stage_name + "_0"] = first_module

        # add more ShuffleUnits depending on pre-defined number of repeats
        for i in range(self.stage_repeats[stage - 2]):
            name = stage_name + "_{}".format(i + 1)
            module = ShuffleUnit(
                self.stage_out_channels[stage],
                self.stage_out_channels[stage],
                groups=self.groups,
                grouped_conv=True,
                combine='add'
            )
            modules[name] = module

        return nn.Sequential(modules)

    def forward(self, x):
        x = self.conv1(x)
        x = self.maxpool(x)

        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)

        # global average pooling layer
        x = F.avg_pool2d(x, x.data.size()[-2:])

        # flatten for input to fully-connected layer
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x
    

if __name__ == '__main__':
    net = ShuffleNetV1(in_channels=1, num_classes=10)
    X = torch.rand(size=(1, 1, 224, 224), dtype=torch.float32)
    print(net(X).shape)

2 ShuffleNet V1在Fashion-MNIST数据集上的应用示例

2.1 创建ShuffleNet V1网络模型

ShuffleNet V1网络，如1.3所示。

2.2 读取Fashion-MNIST数据集

其他所有的函数，与经典神经网络(1)LeNet及其在Fashion-MNIST数据集上的应用完全一致。

# 我们将图片大小设置224×224
# 训练机器内存有限，将批量大小设置为64
batch_size = 64

train_iter,test_iter = get_mnist_data(batch_size,resize=224)

2.3 在GPU上进行模型训练

from _10_ShuffleNetV1 import ShuffleNetV1

# 初始化模型,并设置为10分类
net = ShuffleNetV1(in_channels=1, num_classes=10)

lr, num_epochs = 0.1, 10
train_ch(net, train_iter, test_iter, num_epochs, lr, try_gpu())