EfficientNet网络结构详细解读+SE注意力机制+pytorch框架复现

在这里插入图片描述

文章目录

🚀🚀🚀前言
一、1️⃣ 网络详细结构
- 1.1 🎓 MBConv结构
- 1.2 ✨SE注意力机制模块
- 1.3 ⭐️Depthwise Separable Convolution深度可分离卷积
- - 1.3.1 普通卷积操作(Convolution)
  - 1.3.2 逐深度卷积（Depthwise Convolution）
  - 1.3.3 逐点卷积（Pointwise Convolution）
  - 🔥1.3.4 深度可分离卷积（Depthwise Separable Convolution）
  - 1.3.5 深度可分离卷积参数量计算
- 1.4 🎯B0~B7参数设置
二、2️⃣关于EfficientNet网络使用pytorch搭建

在这里插入图片描述

👀🎉📜系列文章目录

【EfficientNetv1：卷积神经网络模型缩放的再思考（Rethinking Model Scaling for Convolutional Neural Networks）】

🚀🚀🚀前言

在前一篇博客当中已经详细解读了EfficientNet论文中的相关工作和实验，该论文是为了让调节网络的深度、宽度、输入scale达到一个合适的比例，以此让网络的特征提取部分达到最好的效果。但是在原论文当中其实是没有详细介绍网络结构搭建，这篇博客也是对EfficientNet网络结构进行深度详细剖析，详细介绍该网络中用到的SE注意力机制、深度可分离卷积以及MBconv结构，并且使用pytorch框架进行复现。

一、1️⃣ 网络详细结构

在这里插入图片描述
上面这个表格是EfficientNet-B0 网络结构图(B1~B7的的区别就是resolution、channels、layers 的参数设置不同)，从论文网络表格中可以看出，整体网络有9个stage。

第一个stage：就是普通的3x3卷积核、步长stride=2的卷积层，卷积层中包含BN层和Swish激活函数
2~8个stage：这几个stage在重复堆叠MBConv结构（最后一列的Layers表示该Stage重复MBConv结构多少次）
第九个stage：也就是由一个普通的1x1的卷积层（包含BN和激活函数Swish）一个平均池化层和一个全连接层组成。

表格中每个MBConv后会跟一个数字1或6，这里的1或6就是倍率因子n，即MBConv中第一个1x1的卷积层会将输入特征矩阵的channels扩充为n倍，其中k3x3或k5x5表示MBConv中Depthwise Conv(深度可分离卷积)所采用的卷积核大小。Channels表示通过该Stage后输出特征矩阵的Channels。Resolution表示输入的特征图尺寸(使用长、宽表示)

1.1 🎓 MBConv结构

在这里插入图片描述
如图所示，MBConv结构主要由一个1x1的普通卷积（升维作用，包含BN和Swish），一个 $k * k$ 的Depthwise Conv卷积（包含BN和Swish深度可分离卷积），k的具体值可看EfficientNet-B0的网络框架主要有3x3和5x5两种情况，一个SE模块，一个1x1的普通卷积（降维作用，包含BN），一个Droupout层构成。

相关细节：

第一个升维的1x1卷积层，它的卷积核个数是输入特征矩阵channel的n倍，我们看见网络中的MBConv1和MBConv6后面跟着的1、6就是倍率因子。
如果是MBConv6，那么这里就将特征的channel升维6倍。
当n=1时，不要第一个升维的1x1卷积层，即Stage2中的MBConv结构一样，都没有第一个升维的1x1卷积层(这和MobileNetV3网络类似)，在源码中我们就舍弃MBConv中的第一层卷积构建。
经过1x1卷积升维之后的特征，再经过深度可分离层之后，特征尺寸需要保持不变(只有保证分组数和输入通道数保持一致才能确保输入和输入的channel保持不变)。
关于shortcut连接(类似于残差连接)，只有当输入MBConv结构的特征矩阵与输出的特征矩阵shape相同时才能进行。
MBconv中的Droupout和全连接中的Droupout是不相同的，在源码中只有使用到shortcut连接的MBConv模块才有Dropout层。

小tips：shortcut与concat连接区别？
Shortcut 连接直接将前一层的输出与后一层的输出相加，通过残差学习的方式，帮助网络更轻松地学习到恒等映射（identity mapping），从而解决了梯度消失或梯度爆炸的问题，加速了网络的训练过程。这意味着在残差连接中，输入和输出的特征图大小应该相同，以便可以直接将它们相加。
Concat 连接通常用于连接具有不同特征图大小的两个层，以实现跨层信息传递，Concat 连接将两个特征图在通道维度上进行拼接，将它们串联在一起形成一个更大的特征图，连接的两个特征图维度可以不同(宽度和高度)，但它们的通道数必须相同。

1.2 ✨SE注意力机制模块

在这里插入图片描述在上面MBConv结构结构中插入了一个SE模块，经过Depthwise Conv层之后的特征图在传入SE模块之后分成两条分支，上面一条将当前的feature map保留下来，第二天分支首先将特征图进行全局平均池化，然后经过第一个全连接层FC1进行降维操作，再经过第二全连接层FC2进行升维操作，最后将两条分支的特征矩阵进行相乘。

第一个全连接层的节点个数是输入该MBConv特征矩阵channels的1/4，且使用的是swish激活函数；
第二个全连接层的节点个数等于经过Depthwise Conv层之后的特征图channels，且使用的是sigmoid激活函数。

1.3 ⭐️Depthwise Separable Convolution深度可分离卷积

在某些轻量级的网络，如mobilenet中，会有深度可分离卷积(depthwise separable convolution)，由depthwise(DW 逐深度卷积)和pointwise(PW 逐点卷积)两个部分结合起来，用来提取特征feature map。相比常规的卷积操作，其参数数量和运算成本比较低。如果想要理解深度可分离卷积的话一定要对普通卷积、逐深度卷积、逐点卷积进行逐步了解。

1.3.1 普通卷积操作(Convolution)

如下图所示，假设输入层为一个大小为NxN像素、3通道彩色图片。经过一个包含4个Filter的卷积层，最终输出4个Feature Map，且尺寸与输入层相同。此时，卷积层共4个Filter，每个Filter包含了3个Kernel，每个Kernel的大小为3×3。因此，卷积层的参数量为： $Params_{conv}$ = $3 * 3 * 3 * 4$

在这里插入图片描述

公式1：每个卷积层的参数量计算公式如下：
$params=C_o\times(k_w\times k_h\times C_i+1)$
其中 $C_{i}$ 表示输入图像通道数， $k_{w}$ 和 $k_{h}$ 表示卷积核的大小， $C_{0}$ 表示输出图像通道数 (也就是卷积核中的Filters) ，+1表示bias偏置，需要注意的是，使用Batch Normalization时不需要bias，此时计算式中的+1项去除。一般我们输入的图像尺寸都是正方形的， $k_{w}$ = $k_{h}$ = $k$ ，所以上述公式可以简化为：
$params=C_o\times k^2\times C_i$
公式2：每个卷积层计算量公式如下：
$FLOPs=C_i\times k^2\times C_o\times W\times H$
其中 $C_{i}$ 表示输入图像通道数， $C_{0}$ 表示输出通道卷积核大小，W和H表示输出的特征图尺寸；

1.3.2 逐深度卷积（Depthwise Convolution）

简单理解，逐深度卷积就是深度(channel)维度不变，改变特征图尺寸H/W；

逐深度卷积（Depthwise convolution，DWConv）与标准卷积的区别在于，深度卷积的卷积核为单通道模式，需要对输入的每一个通道进行卷积，这样就会得到和输入特征图通道数一致的输出特征图。即有输入特征图通道数=卷积核个数=输出特征图个数。

假设，一个大小为N×N像素、3通道彩色图片，3个单通道卷积核分别进行卷积计算，输出3个单通道的特征图。所以，一个3通道的图像经过运算后生成了3个Feature map，如下图所示。其中一个Filter只包含一个大小为3×3的Kernel，卷积部分的参数量为 $Params_{depthwise Conv}$ = $3 * 3 * 3 = 27$ ，（这里的前两个3表示卷积核的大小，最后一个3表示卷积核的个数Filters。）
在这里插入图片描述

1.3.3 逐点卷积（Pointwise Convolution）

简单理解，逐点卷积就是W/H维度不变(不改变特征图的尺寸)，改变特征图的通道数channel。

根据深度卷积可知，输入特征图通道数=卷积核个数=输出特征图个数，这样会导致输出的特征图个数过少（或者说输出特征图的通道数过少，可看成是输出特征图个数为1，通道数为3），从而可能影响信息的有效性。此时，就需要进行逐点卷积。

逐点卷积（Pointwise Convolution，PWConv）实质上是用1x1的卷积核进行升维。例如在GoogleNet的三个版本中都大量使用 $1 * 1$ 的卷积核，那里主要是用来降维。 $1 * 1$ 的卷积核主要作用是对特征图进行升维和降维，但是不改变特征图的尺寸大小。

如下图所示，从深度卷积得到的3个单通道特征图，经过4个大小为 $1 * 1 * 3$ 卷积核的卷积计算，输出4个特征图，而输出特征图的个数取决于Filter的个数。因此，卷积层的参数量为： $Params_{pointwise Conv}$ = $1 * 1 * 3 * 4 = 12$

在这里插入图片描述

🔥1.3.4 深度可分离卷积（Depthwise Separable Convolution）

简单点说深度可分离卷积就是由深度卷积（depthwise convolution）加逐点卷积（pointwise convolution）组成。

深度卷积用于提取空间特征，逐点卷积用于提取通道特征。深度可分离卷积在特征维度上分组卷积，对每个channel进行独立的逐深度卷积，并在输出前使用一个1x1卷积（逐点卷积）将所有通道(channels)进行聚合。

在这里插入图片描述
🚀🚀🚀关于深度可分离卷积的参数量计算，其实是等于逐深度卷积参数量+逐点卷积参数量；假设一个大小为N×N像素、3通道彩色图片，3个单通道卷积核分别进行卷积计算，输出3个单通道的特征图(逐深度卷积)，再经过4个大小为 $1 * 1 * 3$ 卷积核的卷积计算，输出4个特征图(逐点卷积)，最后的参数量 $Params_{Depthwise Separable Conv}$ = $27 + 12 = 39$

1.3.5 深度可分离卷积参数量计算

📌📌📌代码实现

import torch
import torch.nn as nn


# 定义普通卷积层
def compute_conv_params(in_channels, out_channels, kernel_size):
    conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, bias=False)
    num_params = sum(p.numel() for p in conv_layer.parameters() if p.requires_grad)
    return num_params


# 定义逐深度卷积层
def compute_depthwise_conv_params(in_channels, kernel_size):
    depthwise_conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=kernel_size,
                                     groups=in_channels, bias=False)
    num_params = sum(p.numel() for p in depthwise_conv_layer.parameters() if p.requires_grad)
    return num_params


# 定义逐点卷积层
def compute_pointwise_conv_params(in_channels, out_channels):
    pointwise_conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, bias=False)
    num_params = sum(p.numel() for p in pointwise_conv_layer.parameters() if p.requires_grad)
    return num_params


# 输入通道数
in_channels = 3
# 输出通道数
out_channels = 4
# 卷积核大小
kernel_size = 3


# 定义不带偏置的深度可分离卷积层
def compute_depthwise_separable_conv_params_no_bias(in_channels, out_channels, kernel_size):
    depthwise_conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=kernel_size,
                                     groups=in_channels, bias=False)
    pointwise_conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, bias=False)

    num_depthwise_params = sum(p.numel() for p in depthwise_conv_layer.parameters() if p.requires_grad)
    num_pointwise_params = sum(p.numel() for p in pointwise_conv_layer.parameters() if p.requires_grad)

    total_params = num_depthwise_params + num_pointwise_params
    return total_params


# 计算普通卷积参数量
conv_params = compute_conv_params(in_channels, out_channels, kernel_size)
print("普通卷积参数量:", conv_params)

# 计算逐深度卷积参数量
depthwise_conv_params = compute_depthwise_conv_params(in_channels, kernel_size)
print("逐深度卷积参数量:", depthwise_conv_params)

# 计算逐点卷积参数量
pointwise_conv_params = compute_pointwise_conv_params(in_channels, out_channels)
print("逐点卷积参数量:", pointwise_conv_params)

# 计算不带偏置的深度可分离卷积参数量
depthwise_separable_conv_params_no_bias = compute_depthwise_separable_conv_params_no_bias(in_channels, out_channels,
                                                                                          kernel_size)
print("不带偏置的深度可分离卷积参数量:", depthwise_separable_conv_params_no_bias)

📌📌📌代码运行结果：
在这里插入图片描述

1.4 🎯B0~B7参数设置

在这里插入图片描述

width coeicient代表channel维度上的倍率因子，调整宽度也就是调整卷积核的个数；比如在 EicientNetB0中Stage1的3x3卷积层所使用的卷积核个数是32，那么在B6中就是32x1.8=57.6接着取整到离它最近的8的整数倍即56，其它Stage同理。
depth coeficient代表depth维度上的倍率因子(仅针对Stage2到Stage8)，比如在EficientNetB0中Stage7的L=4，那么在B6中就是4x2.6=10.4，接着向上取整即11。
drop_connect _rate：是MBConv中的随机时候比率。
drop_rate：是整体网络中最后一层全连接层的随机失活比率。

二、2️⃣关于EfficientNet网络使用pytorch搭建

🚀🚀🚀 主要就是搭建MBconv、SE模块、drop_path、卷积模块、设置每个MBconv的配置参数，最后将这些模块进行汇总成EfficientNet网络，主要模块部分都已经添加相关注释，代码如下。

import math
import copy
from functools import partial
from collections import OrderedDict
from typing import Optional, Callable

import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import functional as F


# 将我们传入的channel的个数转换为距离8最近的整数倍
def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


def drop_path(x, drop_prob: float = 0., training: bool = False):
    """
    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    "Deep Networks with Stochastic Depth", https://arxiv.org/pdf/1603.09382.pdf

    This function is taken from the rwightman.
    It can be seen here:
    https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/drop.py#L140
    """
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    """
    Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    "Deep Networks with Stochastic Depth", https://arxiv.org/pdf/1603.09382.pdf
    """

    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)


# 卷积+BN+激活函数模块
class ConvBNActivation(nn.Sequential):
    def __init__(self,
                 input_channel: int,
                 output_channel: int,
                 kernel_size: int = 3,  # 卷积核大小
                 stride: int = 1,
                 groups: int = 1,  # 用来控制我们深度可分离卷积的分组数(DWConv：这里要保证输入和输出的channel不会发生变化)
                 norm_layer: Optional[Callable[..., nn.Module]] = None,  # BN结构
                 activation_layer: Optional[Callable[..., nn.Module]] = None):  # 激活函数
        padding = (kernel_size - 1) // 2
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if activation_layer is None:
            activation_layer = nn.SiLU  # alias Swish  (torch>=1.7)

        # super() 函数接受两个参数：子类名和子类对象，用来指明在哪个子类中调用父类的方法。在这段代码中，ConvBNActivation 是子类名，self 是子类对象。
        # 通过super(ConvBNActivation, self)，Python知道要调用的是ConvBNActivation的父类的方法。
        super(ConvBNActivation, self).__init__(nn.Conv2d(in_channels=input_channel,
                                                         out_channels=output_channel,
                                                         kernel_size=kernel_size,
                                                         stride=stride,
                                                         padding=padding,
                                                         groups=groups,
                                                         bias=False),
                                               norm_layer(output_channel),
                                               activation_layer())


# SE模块：注意力机制
class SqueezeExcitation(nn.Module):
    def __init__(self,
                 input_channel: int,  # block input channel
                 expand_channel: int,  # block expand channel 第一个1X1卷积扩展之后的channel
                 squeeze_factor: int = 4):
        super(SqueezeExcitation, self).__init__()
        squeeze_c = input_channel // squeeze_factor  # 第一个全连接层个数等于输入特征的1/4
        self.fc1 = nn.Conv2d(expand_channel, squeeze_c, 1)  # 压缩特征
        self.ac1 = nn.SiLU()  # alias Swish
        self.fc2 = nn.Conv2d(squeeze_c, expand_channel, 1)  # 拓展特征
        self.ac2 = nn.Sigmoid()

    def forward(self, x: Tensor) -> Tensor:
        scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))  # 全局平均池化
        scale = self.fc1(scale)
        scale = self.ac1(scale)
        scale = self.fc2(scale)
        scale = self.ac2(scale)
        return scale * x  # 与输入的特征进行相乘


# 每个MBconv的配置参数
class InvertedResidualConfig:
    # kernel_size, in_channel, out_channel, exp_ratio, strides, use_SE, drop_connect_rate
    def __init__(self,
                 kernel: int,  # 3 or 5 论文中的卷积核大小有3和5
                 input_channel: int,
                 out_channel: int,
                 expanded_ratio: int,  # 1 or 6 #第一个1x1卷积层的扩展倍数，论文中有1和6
                 stride: int,  # 1 or 2
                 use_se: bool,  # True 因为每个MBConv都使用SE模块 所以传入的参数是true
                 drop_rate: float,  # 随机失活比例
                 index: str,  # 1a, 2a, 2b, ... 用了记录当前MBconv当前的名称
                 width_coefficient: float):  # 网络宽度的倍率因子
        self.input_c = self.adjust_channels(input_channel, width_coefficient)
        self.kernel = kernel
        self.expanded_c = self.input_c * expanded_ratio
        self.out_c = self.adjust_channels(out_channel, width_coefficient)
        self.use_se = use_se
        self.stride = stride
        self.drop_rate = drop_rate
        self.index = index

    # 后续如果想要继续使用B1~B7，可以使用B0的channel乘以倍率因子
    @staticmethod
    def adjust_channels(channels: int, width_coefficient: float):
        return _make_divisible(channels * width_coefficient, 8)


# 搭建MBconv模块
class InvertedResidual(nn.Module):
    def __init__(self,
                 cnf: InvertedResidualConfig,
                 norm_layer: Callable[..., nn.Module]):
        super(InvertedResidual, self).__init__()

        if cnf.stride not in [1, 2]:
            raise ValueError("illegal stride value.")

        self.use_res_connect = (cnf.stride == 1 and cnf.input_c == cnf.out_c)  # 当满足两个条件之后才能使用shortcut连接

        layers = OrderedDict()  # 创建空的有序字典用来保存MBConv
        activation_layer = nn.SiLU  # alias Swish

        # 搭建1x1升维卷积 ：这里其实是有个小技巧，论文中的MBconv的第一个1x1卷积是为了做升维操作，如果我们的expand为1的时候，可以不搭建第一个卷积层
        if cnf.expanded_c != cnf.input_c:
            layers.update({"expand_conv": ConvBNActivation(cnf.input_c,
                                                           cnf.expanded_c,
                                                           kernel_size=1,
                                                           norm_layer=norm_layer,
                                                           activation_layer=activation_layer)})

        # depthwise 搭建深度可分离卷积（这里要保证输入和输出的channel不会发生变化）
        layers.update({"dwconv": ConvBNActivation(cnf.expanded_c,
                                                  cnf.expanded_c,
                                                  kernel_size=cnf.kernel,
                                                  stride=cnf.stride,
                                                  groups=cnf.expanded_c,  # 只有保证分组数和输入通道数保持一致才能确保输入和输入的channel保持不变
                                                  norm_layer=norm_layer,
                                                  activation_layer=activation_layer)})

        if cnf.use_se:
            layers.update({"se": SqueezeExcitation(cnf.input_c,
                                                   cnf.expanded_c)})

        # project
        layers.update({"project_conv": ConvBNActivation(cnf.expanded_c,
                                                        cnf.out_c,
                                                        kernel_size=1,
                                                        norm_layer=norm_layer,
                                                        activation_layer=nn.Identity)})  # Identity 不做任何激活处理

        self.block = nn.Sequential(layers)
        self.out_channels = cnf.out_c
        self.is_strided = cnf.stride > 1

        # 只有在使用shortcut连接时才使用dropout层
        if self.use_res_connect and cnf.drop_rate > 0:
            self.dropout = DropPath(cnf.drop_rate)
        else:
            self.dropout = nn.Identity()

    def forward(self, x: Tensor) -> Tensor:
        result = self.block(x)
        result = self.dropout(result)
        if self.use_res_connect:
            result += x

        return result


class EfficientNet(nn.Module):
    def __init__(self,
                 width_coefficient: float,
                 depth_coefficient: float,
                 num_classes: int = 1000,
                 dropout_rate: float = 0.2,  # 网络中最后一个全连接层的失活比例
                 drop_connect_rate: float = 0.2,  # 是MBconv中的随机失活率
                 block: Optional[Callable[..., nn.Module]] = None,
                 norm_layer: Optional[Callable[..., nn.Module]] = None
                 ):
        super(EfficientNet, self).__init__()

        # kernel_size, in_channel, out_channel, exp_ratio, strides, use_SE, drop_connect_rate, repeats
        default_cnf = [[3, 32, 16, 1, 1, True, drop_connect_rate, 1],
                       [3, 16, 24, 6, 2, True, drop_connect_rate, 2],
                       [5, 24, 40, 6, 2, True, drop_connect_rate, 2],
                       [3, 40, 80, 6, 2, True, drop_connect_rate, 3],
                       [5, 80, 112, 6, 1, True, drop_connect_rate, 3],
                       [5, 112, 192, 6, 2, True, drop_connect_rate, 4],
                       [3, 192, 320, 6, 1, True, drop_connect_rate, 1]]

        def round_repeats(repeats):
            """Round number of repeats based on depth multiplier."""
            return int(math.ceil(depth_coefficient * repeats))

        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            norm_layer = partial(nn.BatchNorm2d, eps=1e-3, momentum=0.1)

        adjust_channels = partial(InvertedResidualConfig.adjust_channels,
                                  width_coefficient=width_coefficient)

        # build inverted_residual_setting
        bneck_conf = partial(InvertedResidualConfig,
                             width_coefficient=width_coefficient)

        b = 0
        num_blocks = float(sum(round_repeats(i[-1]) for i in default_cnf))
        inverted_residual_setting = []
        for stage, args in enumerate(default_cnf):
            cnf = copy.copy(args)
            for i in range(round_repeats(cnf.pop(-1))):
                if i > 0:
                    # strides equal 1 except first cnf
                    cnf[-3] = 1  # strides
                    cnf[1] = cnf[2]  # input_channel equal output_channel

                cnf[-1] = args[-2] * b / num_blocks  # update dropout ratio
                index = str(stage + 1) + chr(i + 97)  # 1a, 2a, 2b, ...
                inverted_residual_setting.append(bneck_conf(*cnf, index))
                b += 1

        # create layers
        layers = OrderedDict()

        # first conv
        layers.update({"stem_conv": ConvBNActivation(input_channel=3,
                                                     output_channel=adjust_channels(32),
                                                     kernel_size=3,
                                                     stride=2,
                                                     norm_layer=norm_layer)})

        # building inverted residual blocks
        for cnf in inverted_residual_setting:
            layers.update({cnf.index: block(cnf, norm_layer)})

        # build top
        last_conv_input_c = inverted_residual_setting[-1].out_c
        last_conv_output_c = adjust_channels(1280)
        layers.update({"top": ConvBNActivation(input_channel=last_conv_input_c,
                                               output_channel=last_conv_output_c,
                                               kernel_size=1,
                                               norm_layer=norm_layer)})

        self.features = nn.Sequential(layers)
        self.avgpool = nn.AdaptiveAvgPool2d(1)

        classifier = []
        if dropout_rate > 0:
            classifier.append(nn.Dropout(p=dropout_rate, inplace=True))
        classifier.append(nn.Linear(last_conv_output_c, num_classes))
        self.classifier = nn.Sequential(*classifier)

        # initial weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor) -> Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def efficientnet_b0(num_classes=1000):
    # input image size 224x224
    return EfficientNet(width_coefficient=1.0,
                        depth_coefficient=1.0,
                        dropout_rate=0.2,
                        num_classes=num_classes)


def efficientnet_b1(num_classes=1000):
    # input image size 240x240
    return EfficientNet(width_coefficient=1.0,
                        depth_coefficient=1.1,
                        dropout_rate=0.2,
                        num_classes=num_classes)


def efficientnet_b2(num_classes=1000):
    # input image size 260x260
    return EfficientNet(width_coefficient=1.1,
                        depth_coefficient=1.2,
                        dropout_rate=0.3,
                        num_classes=num_classes)


def efficientnet_b3(num_classes=1000):
    # input image size 300x300
    return EfficientNet(width_coefficient=1.2,
                        depth_coefficient=1.4,
                        dropout_rate=0.3,
                        num_classes=num_classes)


def efficientnet_b4(num_classes=1000):
    # input image size 380x380
    return EfficientNet(width_coefficient=1.4,
                        depth_coefficient=1.8,
                        dropout_rate=0.4,
                        num_classes=num_classes)


def efficientnet_b5(num_classes=1000):
    # input image size 456x456
    return EfficientNet(width_coefficient=1.6,
                        depth_coefficient=2.2,
                        dropout_rate=0.4,
                        num_classes=num_classes)


def efficientnet_b6(num_classes=1000):
    # input image size 528x528
    return EfficientNet(width_coefficient=1.8,
                        depth_coefficient=2.6,
                        dropout_rate=0.5,
                        num_classes=num_classes)


def efficientnet_b7(num_classes=1000):
    # input image size 600x600
    return EfficientNet(width_coefficient=2.0,
                        depth_coefficient=3.1,
                        dropout_rate=0.5,
                        num_classes=num_classes)