【6s965-fall2022】量化 Quantization Ⅱ

news2024/9/21 5:37:50

什么是线性量化

在这里插入图片描述
r = S ( q − Z ) r = S(q - Z) r=S(qZ)
式中, S S S是比例因子,通常是一个浮点数; q q q r r r的量化后的表示,是一个整数; Z Z Z也是一个整数,把 q q q中和 Z Z Z相同的整数映射到 r r r中零,因此 Z Z Z是零点偏移。

如何确定参数

在这里插入图片描述

r m i n , r m a x r_{min}, r_{max} rmin,rmax为所有原始权重的最小值和最大值;让 q m i n , q m a x q_{min}, q_{max} qmin,qmax为量化范围的最小值和最大值(一般为 2 n , 2 n − 1 2^n, 2^n-1 2n,2n1,对于某个 n n n)。那么,可以得到 r m i n = S ( q m i n − Z ) r_{min} = S(q_{min} - Z) rmin=S(qminZ) r m a x = S ( q m a x − Z ) r_{max} = S(q_{max} - Z) rmax=S(qmaxZ)。第一项减去第二项,可以得到 r m a x − r m i n = S ( q m a x − q m i n ) r_{max} - r_{min} = S(q_{max} - q_{min}) rmaxrmin=S(qmaxqmin),即
S = r m a x − r m i n q m a x − q m i n S=\frac{r_{max} - r_{min}}{q_{max} - q_{min}} S=qmaxqminrmaxrmin

在这里插入图片描述

对于零点偏移,我们希望我们的量化方案能够准确表示零点。因此,尽管从公式 r m i n = S ( q m i n − Z ) r_{min} = S(q_{min} - Z) rmin=S(qminZ)来看,我们可以得到 Z = q m i n − r m i n S Z = q_{min} - \frac{r_{min}}{S} Z=qminSrmin,从这个式子上看,有可能 Z Z Z不是整数,因此我们将其改为
Z = round ( q m i n − r m i n S ) Z = \text{round}\left(q_{min} - \frac{r_{min}}{S}\right) Z=round(qminSrmin)

在这里插入图片描述

非对称 VS 对称

在这里插入图片描述
左边是非对称量化,因其可代表的正负值之间的不对称性而得名。

右边是对称量化,在这种情况下, Z Z Z的值固定为 0 0 0,则比例因子 S = ∣ r ∣ m a x q m i n S = \frac{\lvert r \rvert_{max}}{q_{min}} S=qminrmax。虽然实现起来比较容易,处理零的逻辑也比较干净,但这导致量化范围被有效地浪费了(也就是说,有一系列的值可以用我们的方案来表示,但不需要);这在任何ReLU操作之后尤其如此,我们知道这些值将是非负的,这实质上失去了一整点信息。一般来说,这意味着我们不使用这个方案来量化激活,但我们可以用来量化权重。

量化推理(Quantized Inference)

线性量化的矩阵乘法(Linear Quantized Matrix Multiplication)

r = S ( q − Z ) r = S(q - Z) r=S(qZ)
上式子中, q q q r r r的量化后的表示,根据这个映射关系对矩阵乘法进行替换

Y = W X Y = WX Y=WX
S Y ( q Y − Z Y ) = S W ( q W − z W ) ⋅ S X ( q X − Z x ) S_Y(q_Y - Z_Y) = S_W(q_W - z_W) \cdot S_X(q_X - Z_x) SY(qYZY)=SW(qWzW)SX(qXZx)

从而得到线性量化的矩阵乘法,如下所示:

q Y = S W S X S Y ( q W − z W ) ( q X − Z X ) + Z Y q_Y = \frac{S_WS_X}{S_Y}(q_W - z_W)(q_X - Z_X) + Z_Y qY=SYSWSX(qWzW)(qXZX)+ZY
q Y = S W S X S Y ( q W q X − z W q X − Z X q W + Z W Z X ) + Z Y q_Y = \frac{S_WS_X}{S_Y}(q_Wq_X - z_Wq_X - Z_Xq_W + Z_WZ_X) + Z_Y qY=SYSWSX(qWqXzWqXZXqW+ZWZX)+ZY

请注意, Z X q W + Z W Z X Z_Xq_W + Z_WZ_X ZXqW+ZWZX可以预先计算,因为这不取决于具体的输入( Z X Z_X ZX只取决于我们的量化方案);而且,对于对称量化来说,将 Z W = 0 Z_W = 0 ZW=0,可以简化为
q Y = S W S X S Y ( q W q X − Z X q W ) + Z Y q_Y = \frac{S_WS_X}{S_Y}(q_Wq_X - Z_Xq_W) + Z_Y qY=SYSWSX(qWqXZXqW)+ZY

在这里插入图片描述

线性量化的全连接层(Linear Quantized Fully-Connected Layer)

在线性量化的矩阵乘法的基础上,添加偏移量 b b b

Y = W X + b Y = WX + b Y=WX+b
S Y ( q Y − Z Y ) = S W ( q W − z W ) ⋅ S X ( q X − Z x ) + S b ( q b − Z b ) S_Y(q_Y - Z_Y) = S_W(q_W - z_W) \cdot S_X(q_X - Z_x) + S_b(q_b - Z_b) SY(qYZY)=SW(qWzW)SX(qXZx)+Sb(qbZb)

q Y = S W S X S Y ( q W − z W ) ( q X − Z X ) + Z Y + S b S Y ( q b − Z b ) q_Y = \frac{S_WS_X}{S_Y}(q_W - z_W)(q_X - Z_X) + Z_Y + \frac{S_b}{S_Y}(q_b - Z_b) qY=SYSWSX(qWzW)(qXZX)+ZY+SYSb(qbZb)

S b = S W S X S_b=S_WS_X Sb=SWSX,则可以合并同类项,得

q Y = S W S X S Y ( q W q X − z W q X − Z X q W + Z W Z X + q b − Z b ) + Z Y q_Y = \frac{S_WS_X}{S_Y}(q_Wq_X - z_Wq_X - Z_Xq_W + Z_WZ_X + q_b - Z_b) + Z_Y qY=SYSWSX(qWqXzWqXZXqW+ZWZX+qbZb)+ZY

为了使得计算简单,使用线性量化,即令 Z W = 0 Z_W=0 ZW=0 Z b = 0 Z_b=0 Zb=0,得

q Y = S W S X S Y ( q W q X + q b − Z X q W ) + Z Y q_Y = \frac{S_WS_X}{S_Y}(q_Wq_X + q_b - Z_Xq_W) + Z_Y qY=SYSWSX(qWqX+qbZXqW)+ZY

最后,令 q bias = q b − Z X q W q_{\text{bias}} = q_b - Z_Xq_W qbias=qbZXqW,因为 q bias q_{\text{bias}} qbias可以预先计算,得
q Y = S W S X S Y ( q W q X + q bias ) + Z Y q_Y = \frac{S_WS_X}{S_Y}(q_Wq_X + q_{\text{bias}}) + Z_Y qY=SYSWSX(qWqX+qbias)+ZY

在这里插入图片描述

线性量化的卷积层(Linear Quantized Convolution Layer)

事实证明,由于卷积在本质上也是一个线性算子,其量化的推导与线性层的推导极为相似。通过类似的定义(即 Z W = Z b = 0 , S b = S W S X Z_W=Z_b=0,S_b=S_WS_X ZW=Zb=0Sb=SWSX),我们将得到
q Y = S W S X S Y ( Conv ( q W , q X ) + q bias ) + Z Y q_Y = \frac{S_WS_X}{S_Y}\left(\text{Conv}(q_W, q_X) + q_{\text{bias}}\right) + Z_Y qY=SYSWSX(Conv(qW,qX)+qbias)+ZY
其中, q bias = q b − Conv ( q W , Z X ) q_{\text{bias}} = q_b - \text{Conv}(q_W, Z_X) qbias=qbConv(qW,ZX)

总结

通过上面的推导,我们得到了一下线性量化的全连接层和卷积层,因此我们可以把训练好的模型中的所有层都进行线性量化,替换为线性向量化的全连接层和卷积层,从而实现了把模型中的浮点运算替换成整型运算。
在这里插入图片描述
在这里插入图片描述

实现

求补码范围: [ − 2 n − 1 , 2 n − 1 − 1 ] [-2^{n-1}, 2^{n-1}-1] [2n1,2n11] n-bit

# A *n*-bit signed integer can enode integers in the range $[-2^{n-1}, 2^{n-1}-1]$
def get_quantized_range(bitwidth):
    quantized_max = (1 << (bitwidth - 1)) - 1
    quantized_min = -(1 << (bitwidth - 1))
    return quantized_min, quantized_max

实现线性量化表达式: q = i n t ( r o u n d ( r / S ) ) + Z q = \mathrm{int}(\mathrm{round}(r/S)) + Z q=int(round(r/S))+Z

def linear_quantize(fp_tensor, bitwidth, scale, zero_point, dtype=torch.int8) -> torch.Tensor:
    """
    linear quantization for single fp_tensor
      from
        fp_tensor = (quantized_tensor - zero_point) * scale
      we have,
        quantized_tensor = int(round(fp_tensor / scale)) + zero_point
    :param tensor: [torch.(cuda.)FloatTensor] floating tensor to be quantized
    :param bitwidth: [int] quantization bit width
    :param scale: [torch.(cuda.)FloatTensor] scaling factor
    :param zero_point: [torch.(cuda.)IntTensor] the desired centroid of tensor values
    :return:
        [torch.(cuda.)FloatTensor] quantized tensor whose values are integers
    """
    assert(fp_tensor.dtype == torch.float)
    assert(isinstance(scale, float) or 
           (scale.dtype == torch.float and scale.dim() == fp_tensor.dim()))
    assert(isinstance(zero_point, int) or 
           (zero_point.dtype == dtype and zero_point.dim() == fp_tensor.dim()))

    # Step 1: scale the fp_tensor
    scaled_tensor = fp_tensor.div(scale)
    # Step 2: round the floating value to integer value
    rounded_tensor = scaled_tensor.round_()
    rounded_tensor = rounded_tensor.to(dtype)

    # Step 3: shift the rounded_tensor to make zero_point 0
    shifted_tensor = rounded_tensor.add_(zero_point)
   
    # Step 4: clamp the shifted_tensor to lie in bitwidth-bit range
    quantized_min, quantized_max = get_quantized_range(bitwidth)
    quantized_tensor = shifted_tensor.clamp_(quantized_min, quantized_max)
    return quantized_tensor

测试线性量化
在这里插入图片描述
实现参数表达式:

S = ( r m a x − r m i n ) / ( q m a x − q m i n ) S=(r_{\mathrm{max}} - r_{\mathrm{min}}) / (q_{\mathrm{max}} - q_{\mathrm{min}}) S=(rmaxrmin)/(qmaxqmin)
Z = i n t ( r o u n d ( q m i n − r m i n / S ) ) Z = \mathrm{int}(\mathrm{round}(q_{\mathrm{min}} - r_{\mathrm{min}} / S)) Z=int(round(qminrmin/S))

def get_quantization_scale_and_zero_point(fp_tensor, bitwidth):
    """
    get quantization scale for single tensor
    :param fp_tensor: [torch.(cuda.)Tensor] floating tensor to be quantized
    :param bitwidth: [int] quantization bit width
    :return:
        [float] scale
        [int] zero_point
    """
    quantized_min, quantized_max = get_quantized_range(bitwidth)
    fp_max = fp_tensor.max().item()
    fp_min = fp_tensor.min().item()
    
    scale = (fp_max - fp_min) / (quantized_max - quantized_min)
    zero_point = quantized_min - fp_min / scale

    # clip the zero_point to fall in [quantized_min, quantized_max]
    if zero_point < quantized_min:
        zero_point = quantized_min
    elif zero_point > quantized_max:
        zero_point = quantized_max
    else: # convert from float to int using round()
        zero_point = round(zero_point)
    return scale, int(zero_point)

把线性量化表达式和参数表达式封装在一个函数里

def linear_quantize_feature(fp_tensor, bitwidth):
    """
    linear quantization for feature tensor
    :param fp_tensor: [torch.(cuda.)Tensor] floating feature to be quantized
    :param bitwidth: [int] quantization bit width
    :return:
        [torch.(cuda.)Tensor] quantized tensor
        [float] scale
        [int] zero_point
    """
    scale, zero_point = get_quantization_scale_and_zero_point(fp_tensor, bitwidth)
    quantized_tensor = linear_quantize(fp_tensor, bitwidth, scale, zero_point)
    return quantized_tensor, scale, zero_point

上述的实现是非对称量化,接下来,我们实现对称量化。

Z = 0 Z=0 Z=0
r m a x = S ⋅ q m a x r_{\mathrm{max}} = S \cdot q_{\mathrm{max}} rmax=Sqmax

对于对称量化来说,我们只需要根据 r m a x r_{max} rmax表达式求出scale,接着调用函数linear_quantize(tensor, bitwidth, scale, zero_point=0)即可。下面是实现 r m a x r_{max} rmax表达式:

def get_quantization_scale_for_weight(weight, bitwidth):
    """
    get quantization scale for single tensor of weight
    :param weight: [torch.(cuda.)Tensor] floating weight to be quantized
    :param bitwidth: [integer] quantization bit width
    :return:
        [floating scalar] scale
    """
    # we just assume values in weight are symmetric
    # we also always make zero_point 0 for weight
    fp_max = max(weight.abs().max().item(), 5e-7)
    _, quantized_max = get_quantized_range(bitwidth)
    return fp_max / quantized_max

对通道进行对称线性量化

回顾一下,对于二维卷积,权重张量是一个四维张量,形状为(num_output_channels, num_input_channels, kernel_height, kernel_width)

实验表明,对不同的输出通道使用不同的缩放系数 S S S和零点 Z Z Z会有更好的表现。因此,对于非对称量化,我们必须为每个输出通道的子张量确定缩放因子 S S S和零点 Z Z Z;对于对称量化,则只需确定缩放因子 S S S即可。下面是使用对称量化来对每个通道进行线性量化。

def linear_quantize_weight_per_channel(tensor, bitwidth):
    """
    linear quantization for weight tensor
        using different scales and zero_points for different output channels
    :param tensor: [torch.(cuda.)Tensor] floating weight to be quantized
    :param bitwidth: [int] quantization bit width
    :return:
        [torch.(cuda.)Tensor] quantized tensor
        [torch.(cuda.)Tensor] scale tensor
        [int] zero point (which is always 0)
    """
    dim_output_channels = 0
    num_output_channels = tensor.shape[dim_output_channels]
    scale = torch.zeros(num_output_channels, device=tensor.device)
    for oc in range(num_output_channels):
        _subtensor = tensor.select(dim_output_channels, oc)
        _scale = get_quantization_scale_for_weight(_subtensor, bitwidth)
        scale[oc] = _scale
    scale_shape = [1] * tensor.dim()
    scale_shape[dim_output_channels] = -1
    scale = scale.view(scale_shape)
    quantized_tensor = linear_quantize(tensor, bitwidth, scale, zero_point=0)
    return quantized_tensor, scale, 0

量化前
在这里插入图片描述
量化后
在这里插入图片描述
在这里插入图片描述

对全连接层进行量化

要实现对bias进行对称线性量化,首先要求出 S b i a s S_{\mathrm{bias}} Sbias

Z b i a s = 0 Z_{\mathrm{bias}} = 0 Zbias=0
S b i a s = S w e i g h t ⋅ S i n p u t S_{\mathrm{bias}} = S_{\mathrm{weight}} \cdot S_{\mathrm{input}} Sbias=SweightSinput

def linear_quantize_bias_per_output_channel(bias, weight_scale, input_scale):
    """
    linear quantization for single bias tensor
        quantized_bias = fp_bias / bias_scale
    :param bias: [torch.FloatTensor] bias weight to be quantized
    :param weight_scale: [float or torch.FloatTensor] weight scale tensor
    :param input_scale: [float] input scale
    :return:
        [torch.IntTensor] quantized bias tensor
    """
    assert(bias.dim() == 1)
    assert(bias.dtype == torch.float)
    assert(isinstance(input_scale, float))
    if isinstance(weight_scale, torch.Tensor):
        assert(weight_scale.dtype == torch.float)
        weight_scale = weight_scale.view(-1)
        assert(bias.numel() == weight_scale.numel())

    bias_scale = weight_scale * input_scale
    
    quantized_bias = linear_quantize(bias, 32, bias_scale,
                                     zero_point=0, dtype=torch.int32)
    return quantized_bias, bias_scale, 0

对于量化全连接层来说,要预先计算 Q b i a s Q_{\mathrm{bias}} Qbias

Q b i a s = q b i a s − L i n e a r [ Z i n p u t , q w e i g h t ] Q_{\mathrm{bias}} = q_{\mathrm{bias}} - \mathrm{Linear}[Z_{\mathrm{input}}, q_{\mathrm{weight}}] Qbias=qbiasLinear[Zinput,qweight]

def shift_quantized_linear_bias(quantized_bias, quantized_weight, input_zero_point):
    """
    shift quantized bias to incorporate input_zero_point for nn.Linear
        shifted_quantized_bias = quantized_bias - Linear(input_zero_point, quantized_weight)
    :param quantized_bias: [torch.IntTensor] quantized bias (torch.int32)
    :param quantized_weight: [torch.CharTensor] quantized weight (torch.int8)
    :param input_zero_point: [int] input zero point
    :return:
        [torch.IntTensor] shifted quantized bias tensor
    """
    assert(quantized_bias.dtype == torch.int32)
    assert(isinstance(input_zero_point, int))
    return quantized_bias - quantized_weight.sum(1).to(torch.int32) * input_zero_point

接下来,我们就可以计算量化全连接层的输出了

q o u t p u t = ( L i n e a r [ q i n p u t , q w e i g h t ] + Q b i a s ) ⋅ ( S i n p u t S w e i g h t / S o u t p u t ) + Z o u t p u t q_{\mathrm{output}} = (\mathrm{Linear}[q_{\mathrm{input}}, q_{\mathrm{weight}}] + Q_{\mathrm{bias}})\cdot (S_{\mathrm{input}} S_{\mathrm{weight}} / S_{\mathrm{output}}) + Z_{\mathrm{output}} qoutput=(Linear[qinput,qweight]+Qbias)(SinputSweight/Soutput)+Zoutput

def quantized_linear(input, weight, bias, feature_bitwidth, weight_bitwidth,
                     input_zero_point, output_zero_point,
                     input_scale, weight_scale, output_scale):
    """
    quantized fully-connected layer
    :param input: [torch.CharTensor] quantized input (torch.int8)
    :param weight: [torch.CharTensor] quantized weight (torch.int8)
    :param bias: [torch.IntTensor] shifted quantized bias or None (torch.int32)
    :param feature_bitwidth: [int] quantization bit width of input and output
    :param weight_bitwidth: [int] quantization bit width of weight
    :param input_zero_point: [int] input zero point
    :param output_zero_point: [int] output zero point
    :param input_scale: [float] input feature scale
    :param weight_scale: [torch.FloatTensor] weight per-channel scale
    :param output_scale: [float] output feature scale
    :return:
        [torch.CharIntTensor] quantized output feature (torch.int8)
    """
    assert(input.dtype == torch.int8)
    assert(weight.dtype == input.dtype)
    assert(bias is None or bias.dtype == torch.int32)
    assert(isinstance(input_zero_point, int))
    assert(isinstance(output_zero_point, int))
    assert(isinstance(input_scale, float))
    assert(isinstance(output_scale, float))
    assert(weight_scale.dtype == torch.float)

    # Step 1: integer-based fully-connected (8-bit multiplication with 32-bit accumulation)
    if 'cpu' in input.device.type:
        # use 32-b MAC for simplicity
        output = torch.nn.functional.linear(input.to(torch.int32), weight.to(torch.int32), bias)
    else:
        # current version pytorch does not yet support integer-based linear() on GPUs
        output = torch.nn.functional.linear(input.float(), weight.float(), bias.float())

    # Step 2: scale the output
    #         hint: 1. scales are floating numbers, we need to convert output to float as well
    #               2. the shape of weight scale is [oc, 1, 1, 1] while the shape of output is [batch_size, oc]
    output = output.float() * (input_scale * weight_scale / output_scale).view(1, -1)

    # Step 3: shift output by output_zero_point
    output = output + output_zero_point

    # Make sure all value lies in the bitwidth-bit range
    output = output.round().clamp(*get_quantized_range(feature_bitwidth)).to(torch.int8)
    return output

测试量化全连接层的输出
在这里插入图片描述
在这里插入图片描述

对卷积层进行量化

对于量化卷积层来说,要预先计算 Q b i a s Q_{\mathrm{bias}} Qbias

Q b i a s = q b i a s − C O N V [ Z i n p u t , q w e i g h t ] Q_{\mathrm{bias}} = q_{\mathrm{bias}} - \mathrm{CONV}[Z_{\mathrm{input}}, q_{\mathrm{weight}}] Qbias=qbiasCONV[Zinput,qweight]

def shift_quantized_conv2d_bias(quantized_bias, quantized_weight, input_zero_point):
    """
    shift quantized bias to incorporate input_zero_point for nn.Conv2d
        shifted_quantized_bias = quantized_bias - Conv(input_zero_point, quantized_weight)
    :param quantized_bias: [torch.IntTensor] quantized bias (torch.int32)
    :param quantized_weight: [torch.CharTensor] quantized weight (torch.int8)
    :param input_zero_point: [int] input zero point
    :return:
        [torch.IntTensor] shifted quantized bias tensor
    """
    assert(quantized_bias.dtype == torch.int32)
    assert(isinstance(input_zero_point, int))
    return quantized_bias - quantized_weight.sum((1,2,3)).to(torch.int32) * input_zero_point

然后就可以计算量化卷积层的输出了

q o u t p u t = ( C O N V [ q i n p u t , q w e i g h t ] + Q b i a s ) ⋅ ( S i n p u t S w e i g h t / S o u t p u t ) + Z o u t p u t q_{\mathrm{output}} = (\mathrm{CONV}[q_{\mathrm{input}}, q_{\mathrm{weight}}] + Q_{\mathrm{bias}}) \cdot (S_{\mathrm{input}}S_{\mathrm{weight}} / S_{\mathrm{output}}) + Z_{\mathrm{output}} qoutput=(CONV[qinput,qweight]+Qbias)(SinputSweight/Soutput)+Zoutput

def quantized_conv2d(input, weight, bias, feature_bitwidth, weight_bitwidth,
                     input_zero_point, output_zero_point,
                     input_scale, weight_scale, output_scale,
                     stride, padding, dilation, groups):
    """
    quantized 2d convolution
    :param input: [torch.CharTensor] quantized input (torch.int8)
    :param weight: [torch.CharTensor] quantized weight (torch.int8)
    :param bias: [torch.IntTensor] shifted quantized bias or None (torch.int32)
    :param feature_bitwidth: [int] quantization bit width of input and output
    :param weight_bitwidth: [int] quantization bit width of weight
    :param input_zero_point: [int] input zero point
    :param output_zero_point: [int] output zero point
    :param input_scale: [float] input feature scale
    :param weight_scale: [torch.FloatTensor] weight per-channel scale
    :param output_scale: [float] output feature scale
    :return:
        [torch.(cuda.)CharTensor] quantized output feature
    """
    assert(len(padding) == 4)
    assert(input.dtype == torch.int8)
    assert(weight.dtype == input.dtype)
    assert(bias is None or bias.dtype == torch.int32)
    assert(isinstance(input_zero_point, int))
    assert(isinstance(output_zero_point, int))
    assert(isinstance(input_scale, float))
    assert(isinstance(output_scale, float))
    assert(weight_scale.dtype == torch.float)

    # Step 1: calculate integer-based 2d convolution (8-bit multiplication with 32-bit accumulation)
    input = torch.nn.functional.pad(input, padding, 'constant', input_zero_point)
    if 'cpu' in input.device.type:
        # use 32-b MAC for simplicity
        output = torch.nn.functional.conv2d(input.to(torch.int32), weight.to(torch.int32), None, stride, 0, dilation, groups)
    else:
        # current version pytorch does not yet support integer-based conv2d() on GPUs
        output = torch.nn.functional.conv2d(input.float(), weight.float(), None, stride, 0, dilation, groups)
        output = output.round().to(torch.int32)
    if bias is not None:
        output = output + bias.view(1, -1, 1, 1)

    # Step 2: scale the output
    #         hint: 1. scales are floating numbers, we need to convert output to float as well
    #               2. the shape of weight scale is [oc, 1, 1, 1] while the shape of output is [batch_size, oc, height, width]
    output = output.float() * (input_scale * weight_scale / output_scale).view(1, -1, 1, 1)

    # Step 3: shift output by output_zero_point
    #         hint: one line of code
    output = output + output_zero_point

    # Make sure all value lies in the bitwidth-bit range
    output = output.round().clamp(*get_quantized_range(feature_bitwidth)).to(torch.int8)
    return output

对模型进行线性量化

于是,我们可以创建QuantizedConv2dQuantizedLinearQuantizedMaxPool2dQuantizedAvgPool2d对象,使用这些class对模型进行线性量化。

class QuantizedConv2d(nn.Module):
    def __init__(self, weight, bias, 
                 input_zero_point, output_zero_point,
                 input_scale, weight_scale, output_scale,
                 stride, padding, dilation, groups,
                 feature_bitwidth=8, weight_bitwidth=8):
        super().__init__()
        # current version Pytorch does not support IntTensor as nn.Parameter
        self.register_buffer('weight', weight)
        self.register_buffer('bias', bias)

        self.input_zero_point = input_zero_point
        self.output_zero_point = output_zero_point

        self.input_scale = input_scale
        self.register_buffer('weight_scale', weight_scale)
        self.output_scale = output_scale

        self.stride = stride
        self.padding = (padding[1], padding[1], padding[0], padding[0])
        self.dilation = dilation
        self.groups = groups

        self.feature_bitwidth = feature_bitwidth
        self.weight_bitwidth = weight_bitwidth


    def forward(self, x):
        return quantized_conv2d(
            x, self.weight, self.bias, 
            self.feature_bitwidth, self.weight_bitwidth,
            self.input_zero_point, self.output_zero_point,
            self.input_scale, self.weight_scale, self.output_scale,
            self.stride, self.padding, self.dilation, self.groups
            )
        
class QuantizedLinear(nn.Module):
    def __init__(self, weight, bias, 
                 input_zero_point, output_zero_point,
                 input_scale, weight_scale, output_scale,
                 feature_bitwidth=8, weight_bitwidth=8):
        super().__init__()
        # current version Pytorch does not support IntTensor as nn.Parameter
        self.register_buffer('weight', weight)
        self.register_buffer('bias', bias)

        self.input_zero_point = input_zero_point
        self.output_zero_point = output_zero_point

        self.input_scale = input_scale
        self.register_buffer('weight_scale', weight_scale)
        self.output_scale = output_scale

        self.feature_bitwidth = feature_bitwidth
        self.weight_bitwidth = weight_bitwidth

    def forward(self, x):
        return quantized_linear(
            x, self.weight, self.bias, 
            self.feature_bitwidth, self.weight_bitwidth,
            self.input_zero_point, self.output_zero_point,
            self.input_scale, self.weight_scale, self.output_scale
            )

class QuantizedMaxPool2d(nn.MaxPool2d):
    def forward(self, x):
        # current version PyTorch does not support integer-based MaxPool
        return super().forward(x.float()).to(torch.int8)

class QuantizedAvgPool2d(nn.AvgPool2d):
    def forward(self, x):
        # current version PyTorch does not support integer-based AvgPool
        return super().forward(x.float()).to(torch.int8)

对模型进行线性量化

# we use int8 quantization, which is quite popular
feature_bitwidth = weight_bitwidth = 8 
quantized_model = copy.deepcopy(model_fused)
quantized_backbone = []
ptr = 0
while ptr < len(quantized_model.backbone):
    if isinstance(quantized_model.backbone[ptr], nn.Conv2d) and \
        isinstance(quantized_model.backbone[ptr + 1], nn.ReLU):
        conv = quantized_model.backbone[ptr]
        conv_name = f'backbone.{ptr}'
        relu = quantized_model.backbone[ptr + 1]
        relu_name = f'backbone.{ptr + 1}'

        input_scale, input_zero_point = \
            get_quantization_scale_and_zero_point(
                input_activation[conv_name], feature_bitwidth)
        
        output_scale, output_zero_point = \
            get_quantization_scale_and_zero_point(
                output_activation[relu_name], feature_bitwidth)

        quantized_weight, weight_scale, weight_zero_point = \
            linear_quantize_weight_per_channel(conv.weight.data, weight_bitwidth)
        quantized_bias, bias_scale, bias_zero_point = \
            linear_quantize_bias_per_output_channel(
                conv.bias.data, weight_scale, input_scale)
        shifted_quantized_bias = \
            shift_quantized_conv2d_bias(quantized_bias, quantized_weight, 
                                        input_zero_point)
            
        quantized_conv = QuantizedConv2d(
            quantized_weight, shifted_quantized_bias,
            input_zero_point, output_zero_point,
            input_scale, weight_scale, output_scale,
            conv.stride, conv.padding, conv.dilation, conv.groups,
            feature_bitwidth=feature_bitwidth, weight_bitwidth=weight_bitwidth
        )

        quantized_backbone.append(quantized_conv)
        ptr += 2
    elif isinstance(quantized_model.backbone[ptr], nn.MaxPool2d):
        quantized_backbone.append(QuantizedMaxPool2d(
            kernel_size=quantized_model.backbone[ptr].kernel_size,
            stride=quantized_model.backbone[ptr].stride
            ))
        ptr += 1
    elif isinstance(quantized_model.backbone[ptr], nn.AvgPool2d):
        quantized_backbone.append(QuantizedAvgPool2d(
            kernel_size=quantized_model.backbone[ptr].kernel_size,
            stride=quantized_model.backbone[ptr].stride
            ))
        ptr += 1
    else:
        raise NotImplementedError(type(quantized_model.backbone[ptr]))  # should not happen
quantized_model.backbone = nn.Sequential(*quantized_backbone)

# finally, quantized the classifier
fc_name = 'classifier'
fc = model.classifier
input_scale, input_zero_point = \
    get_quantization_scale_and_zero_point(
        input_activation[fc_name], feature_bitwidth)

output_scale, output_zero_point = \
    get_quantization_scale_and_zero_point(
        output_activation[fc_name], feature_bitwidth)

quantized_weight, weight_scale, weight_zero_point = \
    linear_quantize_weight_per_channel(fc.weight.data, weight_bitwidth)
quantized_bias, bias_scale, bias_zero_point = \
    linear_quantize_bias_per_output_channel(
        fc.bias.data, weight_scale, input_scale)
shifted_quantized_bias = \
    shift_quantized_linear_bias(quantized_bias, quantized_weight, 
                                input_zero_point)
            
quantized_model.classifier = QuantizedLinear(
    quantized_weight, shifted_quantized_bias,
    input_zero_point, output_zero_point,
    input_scale, weight_scale, output_scale,
    feature_bitwidth=feature_bitwidth, weight_bitwidth=weight_bitwidth
)

在这里插入图片描述
计算量化后模型的准确率
在这里插入图片描述

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/196825.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

24_mimikatz

mimikatz 一、介绍 二、修改注册表抓取明文密码 当目标为win10或2012R2以上时&#xff0c;默认在内存缓存中禁止保存明文密码&#xff0c;但可以通过修改注册表的方式抓取明文。 重启或用户重新登录后可以成功抓取 reg add HKLM\SYSTEM\CurrentControlSet\Control\Security…

Java设计模式总结

java常用七种设计模式总结单例模式单例模式的实现第 1 种&#xff1a;懒汉式单例第 2 种&#xff1a;饿汉式单例工厂方法模式一&#xff0c;简单工厂模式二、工厂方法模式三、抽象工厂模式建造者模式策略模式模板方法责任链模式代理模式适配器模式观察者模式单例模式 单例模式…

第三届腾讯Light技术公益创造营启动

简介 腾讯Light技术公益创造营通过持续探索技术与公益的结合&#xff0c;已经打造出了包括推进公众参与中华白海豚保育的公益计划、使用AI技术助力新生儿黄疸诊断的“高危儿拯救计划”等一批优秀项目 全面升级的第三届腾讯Light技术公益创造营在海南海口正式启动&#xff0c;…

Redis 分布式锁实现详解

一、概述分布式锁&#xff0c;即分布式系统中的锁。在单体应用中我们通过锁解决的是控制共享资源访问的问题&#xff0c;而分布式锁&#xff0c;就是解决了分布式系统中控制共享资源访问的问题。与单体应用不同的是&#xff0c;分布式系统中竞争共享资源的最小粒度从线程升级成…

Go练手==若依go语言版本开发day01

代码地址&#xff1a;ry-vue-go: 使用GO开发若依后台 启动方式&#xff1a; pycharm打开项目后点击main.go启动后端 打开文件夹浏览器后&#xff0c;点击ruoyi-ui下的z_start_ui.bat启动前端 数据库 mysql 缓存redis ORM框架 GORM WEB框架 GIN 今日实现功能 mysql连接池U…

python采集《狂飙》评论,看看是什么让它如此火爆

前言 大家早好、午好、晚好吖 ❤ ~ “是非面前稍不留神&#xff0c;就会步入万丈深渊&#xff0c;唯有坚守信仰&#xff0c;才能守得初心” 2023年首部爆款剧集《狂飙》迎来大结局&#xff0c;今天我们就来采集一下评论,看看为什么它这么火爆 开发环境: python 3.8 pycharm …

计算机网络-TCP如何保证传输可靠性

TCP协议传输的特点主要就是面向字节流、传输可靠、面向连接。 TCP协议如何确保传输的可靠性的? 确保传输可靠性的方式 TCP协议保证数据传输可靠性的方式主要有&#xff1a; 1.校验和 2.序列号 3.确认应答 4.超时重传 5.连接管理 6.流量控制 7.拥塞控制 1.校验和 发送方&a…

MVC,MVP和MVVM框架之间的理解

一、MVC的理解 1、MVC是什么 MVC, 即Model-View-Controller, 基于页面逻辑的修改要多于业务逻辑, 分离两种逻辑减少类代码的修改 Model: 即数据层, 负责处理业务逻辑, 监听网络与数据库接口View: 即界面(UI)层, 显示来源于Model的数据Contoller: 即逻辑层, 传递用户的交互和更…

面向对象三大特征之三:多态、内部类、常用API

目录 面向对象三大特征之三&#xff1a;多态 多态的概述、形式 多态的优势 多态下引用数据类型的类型转换 多态的案例 内部类 内部类的概述 内部类之一&#xff1a;静态内部类[了解] 内部类之二&#xff1a;成员内部类[了解] 内部类之三&#xff1a;局部内部类[了解]…

智能DTU

什么是DTU百度百科&#xff1a;DTU (Data Transfer unit)&#xff0c;是专门用于将串口数据转换为 IP 数据或将 IP 数据转换为串口数据通过无线通信网络进行传送的无线终端设备。DTU目前现状在物联网大爆发的时代&#xff0c;除了各种传感器&#xff0c;DTU 设备也得到了很大的…

免费舆情监控工具有哪些,TOOM加强舆情监控提高应对能力

免费舆情监控是指通过免费的网络舆情监控工具&#xff0c;对网络上关于话题、公司、产品或者人物的舆论进行监测和分析&#xff0c;旨在帮助用户了解社会舆论的变化和趋势。免费舆情监控工具一般提供基本的舆论监测功能&#xff0c;功能较为有限&#xff0c;但是对于个人用户或…

JavaWeb-HTML入门

目录一、认识HTML二、学习HTML语法2.1 注释标签2.2 标题标签2.3 段落标签 p2.3 换行标签2.4 字体设置标签2.5 图片标签2.6 超链接标签2.7 表格标签2.8 列表标签2.9 表单标签2.10 select下拉菜单2.11 textarea多行编辑框2.12 无语义标签一、认识HTML HTML的全称为超文本标记语言…

车载网络 - Autosar网络管理 - 处理逻辑时间参数

前面已经将Autosar常用的缩写名词、运行状态机、个状态报文发送状态等基础信息进行了说明,下面就剩下时间参数和网络管理逻辑,如果要讲处理的逻辑的话我个人觉得还是先要把时间参数介绍下,毕竟时间参数贯穿于整个处理逻辑中。 时间参数: 作为衡量一个软件产品性能的重要指标…

简单地了解NLP

一、基本概念 自然语言处理&#xff08;NLP&#xff09;是研究人与计算机交互的语言问题的一门学科。它研究人与计算机之间用自然语言进行有效通信的理论和方法。融语言学、计算机科学、数学等于一体的科学。自然语言处理的工作原理是先接收到通过人类使用演变而来的自然语言&…

连锁店销量预测

目录任务数据解读及简单探索解决思路传统机器学习方法代码评估深度学习方法代码评估建议参考文献任务 利用深度学习、强化学习等机器学习手段为某连锁商店预测每个商店未来12周的销售额进行估计&#xff0c;以便对商店的经营更好掌握和对库存调拨更好管理。 数据解读及简单探…

(十二) Docker-compose容器编排

Docker-compose一、概述1、简介2、作用3、下载二、核心概念三、使用步骤四、Compose常用命令五、Compose编排微服务实践一、构建项目的镜像1、建立数据库表2、POM文件3、YAML4、主启动类5、业务类1、生成简易代码2、配置类3、实体类4、Mapper5、Service6、Controller6、打 Jar …

3.SQL

梳理 名词解释 视图&#xff1a;视图是一个虚表&#xff0c;其本质就是一条SELECT语句&#xff0c;而查询结果被赋予了一个名字&#xff0c;也即视图名字。或者说视图本身不包含任何数据&#xff0c;它只包含映射到基表的一个查询语句&#xff0c;当基表数据发生变化时&#x…

【JavaGuide面试总结】Spring篇·中

【JavaGuide面试总结】Spring篇中1.谈谈自己对于 AOP 的了解2.Spring AOP 和 AspectJ AOP 有什么区别&#xff1f;3.AspectJ 定义的通知类型有哪些&#xff1f;4.Spring AOP多个切面的执行顺序如何控制&#xff1f;5.说说自己对于 Spring MVC 了解?6.Spring MVC 的核心组件有哪…

企业数字化转型的五大关键要素

数字化转型是顺应新一轮科技革命和产业变革趋势&#xff0c;不断深化应用云计算、大数据、物联网、人工智能、区块链等新一代信息技术&#xff0c;激发数据要素创新驱动潜能&#xff0c;打造提升信息时代生存和发展能力&#xff0c;加速业务优化升级和创新转型&#xff0c;改造…

软测(概念) · 开发模型 · 软件的生命周期 · 瀑布模型 · 螺线模型 · 增量模型 · 迭代模型 · 敏捷模型 · scrum

一、开发模型的由来二、软件的生命周期三、瀑布模型&#xff08;Waterfall Model&#xff09;四、螺线模型&#xff08;Spiral Model&#xff09;五、增量模型&#xff08;Incremental Model&#xff09;六、迭代模型&#xff08;Rational UnifiedProcess&#xff09;七、敏捷模…