【Pytorch 第四讲】图像分类的Tricks

1. 标签平滑

在分类问题中，最后一层一般是全连接层，然后对应标签的one-hot编码，即把对应类别的值编码为1，其他为0。这种编码方式和通过降低交叉熵损失来调整参数的方式结合起来，会有一些问题。这种方式会鼓励模型对不同类别的输出分数差异非常大，或者说，模型过分相信它的判断。但是，对于一个由多人标注的数据集，不同人标注的准则可能不同，每个人的标注也可能会有一些错误。模型对标签的过分相信会导致过拟合。

标签平滑(Label-smoothing regularization,LSR)是应对该问题的有效方法之一，它的具体思想是降低我们对于标签的信任，例如我们可以将损失的目标值从1稍微降到0.9，或者将从0稍微升到0.1。

总的来说，LSR是一种通过在标签y中加入噪声，实现对模型约束，降低模型过拟合程度的一种正则化方法。

importtorch
importtorch.nn as nn

class LSR(nn.Module):

    def __init__(self, e=0.1, reduction='mean'):
        super().__init__()

        self.log_softmax = nn.LogSoftmax(dim=1)
        self.e = e
        self.reduction = reduction

    def _one_hot(self, labels, classes, value=1):
        """

           Convert labels to one hot vectors

       Args:
           labels: torch tensor in format [label1, label2, label3, ...]
           classes: int, number of classes
           value: label value in one hot vector, default to 1

      Returns:
          return one hot format labels in shape [batchsize, classes]
      """
      
      one_hot = torch.zeros(labels.size(0), classes)

      #labels and value_added  size must match
      labels = labels.view(labels.size(0),  -1)
      value_added = torch.Tensor(labels.size(0),  1).fill_(value)

      value_added = value_added.to(labels.device)
      one_hot = one_hot.to(labels.device)

      one_hot.scatter_add_(1, labels, value_added)

      return one_hot

  def _smooth_label(self, target, length, smooth_factor):
      """convert targets to one-hot format, and smooth them.
      Args:
          target: target in form with [label1, label2, label_batchsize]
          length: length of one-hot format(number of classes)
          smooth_factor: smooth factor for label smooth

     Returns:
         smoothed labels in one hot format
     """
     one_hot = self._one_hot(target, length, value=1  - smooth_factor)
     one_hot += smooth_factor / length

     return one_hot.to(target.device)

2. 随即裁减拼接

随机图像裁剪和拼接（Random Image Cropping and Patching）是一种数据增强技术，广泛应用于计算机视觉任务，特别是在深度学习模型的训练过程中。这种技术通过随机地裁剪图像的区域以及对这些区域进行重新排列或拼接，来生成视觉上多样化的训练样本。其目的和作用主要包括以下几点：

1.增加数据多样性

在训练深度学习模型时，数据的多样性是非常重要的。随机图像裁剪和拼接通过创造出视觉上不同的图像样本，有助于增加训练数据的多样性。这种多样性可以帮助模型学习到更加泛化的特征，从而在面对未见过的数据时能够做出更好的预测。

2. 减少过拟合

过拟合是深度学习中常见的问题，发生在模型对训练数据学得太好，以至于失去了泛化能力，对新的、未见过的数据表现不佳。通过随机图像裁剪和拼接，每次训练时输入模型的图像都有所不同，这可以有效减少模型对特定图像细节的依赖，从而减少过拟合的风险。

3. 提高模型鲁棒性

通过随机变化图像的外观（如位置、尺寸和上下文），随机图像裁剪和拼接强迫模型学习到更加鲁棒的特征，这些特征对于图像的具体排列和外观不那么敏感。这样，模型在遇到新的、不同排列或裁剪的图像时，更有可能做出正确的识别和预测。

4. 支持不同尺寸的输入

在某些情况下，模型可能需要处理不同尺寸的输入图像。随机图像裁剪和拼接可以在一定程度上模拟这种情况，训练模型能够更好地处理不同大小的图像，无需对图像进行标准化的预处理步骤。

代码：裁减

import torch
from torchvision import transforms
from PIL import Image

# 定义一个transform，其中包括随机裁剪和随机水平翻转
transform = transforms.Compose([
    transforms.RandomResizedCrop(size=(224, 224)),  # 随机裁剪到224x224
    transforms.RandomHorizontalFlip(),  # 随机水平翻转
    # 可以添加更多的transforms，例如转换到Tensor，标准化等
    transforms.ToTensor()
])

# 打开图像
image = Image.open("path/to/your/image.jpg")

# 应用transform
transformed_image = transform(image)

# 因为transforms.ToTensor()将PIL图像转换为了Tensor，
# 我们需要使用matplotlib等库来可视化处理后的Tensor图像。
import matplotlib.pyplot as plt

plt.imshow(transformed_image.permute(1, 2, 0))
plt.show()

RICAP 代码：

beta =  0.3  # hyperparameter
for  (images, targets)  in train_loader:

     # get the image size
     I_x, I_y = images.size()[2:]

     # draw a boundry position (w, h)
     w = int(np.round(I_x * np.random.beta(beta, beta)))
     h = int(np.round(I_y * np.random.beta(beta, beta)))
     w_ =  [w, I_x - w, w, I_x - w]
     h_ =  [h, h, I_y - h, I_y - h]

     # select and crop four images
     cropped_images =  {}
     c_ =  {}
     W_ =  {}
     for k in range(4):
         index = torch.randperm(images.size(0))
         x_k = np.random.randint(0, I_x - w_[k]  +  1)
         y_k = np.random.randint(0, I_y - h_[k]  +  1)
         cropped_images[k]  = images[index][:,  :, x_k:x_k + w_[k], y_k:y_k + h_[k]]
         c_[k]  = target[index].cuda()
         W_[k]  = w_[k]  * h_[k]  /  (I_x * I_y)

     # patch cropped images
     patched_images = torch.cat(
                 (torch.cat((cropped_images[0], cropped_images[1]),  2),
                 torch.cat((cropped_images[2], cropped_images[3]),  2)),
             3)
     #patched_images = patched_images.cuda()

     # get output
     output = model(patched_images)
     
     # calculate loss and accuracy
     loss = sum([W_[k]  * criterion(output, c_[k])  for k in range(4)])
     acc = sum([W_[k]  * accuracy(output, c_[k])[0]  for k in range(4)])

3.知识蒸馏

提高几乎所有机器学习算法性能的一种非常简单的方法是在相同的数据上训练许多不同的模型，然后对它们的预测进行平均。但是使用所有的模型集成进行预测是比较麻烦的，并且可能计算量太大而无法部署到大量用户。Knowledge Distillation(知识蒸馏)方法就是应对这种问题的有效方法之一。

在知识蒸馏方法中，我们使用一个教师模型来帮助当前的模型（学生模型）训练。教师模型是一个较高准确率的预训练模型，因此学生模型可以在保持模型复杂度不变的情况下提升准确率。比如，可以使用ResNet-152作为教师模型来帮助学生模型ResNet-50训练。在训练过程中，我们会加一个蒸馏损失来惩罚学生模型和教师模型的输出之间的差异。

给定输入，假定p是真正的概率分布，z和r分别是学生模型和教师模型最后一个全连接层的输出。之前我们会用交叉熵损失l(p,softmax(z))来度量p和z之间的差异，这里的蒸馏损失同样用交叉熵。所以，使用知识蒸馏方法总的损失函数是

！pip install torch torchvision
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models

# 定义学生模型
class StudentModel(nn.Module):
    def __init__(self):
        super(StudentModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 56 * 56, 128)
        self.fc2 = nn.Linear(128, 10)  # 假设是10分类任务

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# 使用预训练的ResNet作为教师模型
teacher_model = models.resnet18(pretrained=True)

# 初始化学生模型
student_model = StudentModel()


def distillation_loss(y_student, y_teacher, T):
    """
    计算蒸馏损失
    :param y_student: 学生模型的输出
    :param y_teacher: 教师模型的输出
    :param T: 温度参数
    :return: 蒸馏损失
    """
    loss = F.kl_div(F.log_softmax(y_student / T, dim=1),
                    F.softmax(y_teacher / T, dim=1),
                    reduction='batchmean') * (T * T)
    return loss




import torch.optim as optim

# 定义优化器和损失函数
optimizer = optim.SGD(student_model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

# 假设我们有一个数据加载器
# trainloader = ...

# 训练学生模型
for epoch in range(10):  # 遍历数据集多次
    for inputs, labels in trainloader:
        optimizer.zero_grad()

        # 获取教师模型和学生模型的输出
        with torch.no_grad():
            outputs_teacher = teacher_model(inputs)
        outputs_student = student_model(inputs)

        # 计算蒸馏损失和分类损失
        loss_distillation = distillation_loss(outputs_student, outputs_teacher, T=20)
        loss_classification = criterion(outputs_student, labels)

        # 总损失是蒸馏损失和分类损失的加

4. Cutout

Cutout是一种新的正则化方法。原理是在训练时随机把图片的一部分减掉，这样能提高模型的鲁棒性。它的来源是计算机视觉任务中经常遇到的物体遮挡问题。通过cutout生成一些类似被遮挡的物体，不仅可以让模型在遇到遮挡问题时表现更好，还能让模型在做决定时更多地考虑环境(context)。

import torch
import numpy as np

class  Cutout(object):
    """Randomly mask out one or more patches from an image.
    Args:
        n_holes (int): Number of patches to cut out of each image.
        length (int): The length (in pixels) of each square patch.
    """
    def __init__(self, n_holes, length):
        self.n_holes = n_holes
        self.length = length
 
    def __call__(self, img):
        """
        Args:
            img (Tensor): Tensor image of size (C, H, W).
        Returns:
            Tensor: Image with n_holes of dimension length x length cut out of it.
        """
        h = img.size(1)
        w = img.size(2)
        
        mask = np.ones((h, w), np.float32)
        
        for n in range(self.n_holes):
            y = np.random.randint(h)
            x = np.random.randint(w)
            
            y1 = np.clip(y - self.length //  2,  0, h)
            y2 = np.clip(y + self.length //  2,  0, h)
            x1 = np.clip(x - self.length //  2,  0, w)
            x2 = np.clip(x + self.length //  2,  0, w)
            
            mask[y1: y2, x1: x2]  =  0.
            
        mask = torch.from_numpy(mask)
        mask = mask.expand_as(img)
        img = img * mask
        
        return img

5. mixup training

Mixup training 是一种数据增强技术，用于提高深度学习模型的泛化能力和性能。这种技术最初由Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz 在论文 "mixup: Beyond Empirical Risk Minimization" 中提出。Mixup 通过在图像和标签级别进行线性插值，生成新的训练样本，从而增加训练数据的多样性。

其中，λ是从Beta(α, α)随机采样的数，在[0,1]之间。在训练过程中，仅使用(xhat, yhat)。Mixup方法主要增强了训练样本之间的线性表达，增强网络的泛化能力，不过mixup方法需要较长的时间才能收敛得比较好。

实现1：

for  (images, labels)  in train_loader:

     l = np.random.beta(mixup_alpha, mixup_alpha)
     
     index = torch.randperm(images.size(0))
     images_a, images_b = images, images[index]
     labels_a, labels_b = labels, labels[index]
     
     mixed_images = l * images_a +  (1  - l)  * images_b
     
     outputs = model(mixed_images)
     loss = l * criterion(outputs, labels_a)  +  (1  - l)  * criterion(outputs, labels_b)
     acc = l * accuracy(outputs, labels_a)[0]  +  (1  - l)  * accuracy(outputs, labels_b)[0]

实现2:

for i,(images,target) in enumerate(train_loader):
    # 1.input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
 
    # 2.mixup
    alpha=config.alpha
    lam = np.random.beta(alpha,alpha)
   #randperm返回1~images.size(0)的一个随机排列
    index = torch.randperm(images.size(0)).cuda()
    inputs = lam*images + (1-lam)*images[index,:]
    targets_a, targets_b = target, target[index]
    outputs = model(inputs)
    loss = lam * criterion(outputs, targets_a) + (1 - lam) * criterion(outputs, targets_b)
 
    # 3.backward
    optimizer.zero_grad()   # reset gradient
    loss.backward()

6. AdaBound

AdaBound 是一种优化算法，旨在结合 Adam 优化器的快速收敛性和 SGD 优化器在训练后期的稳定性。AdaBound 通过动态调整每一步的学习率，使其在训练过程中逐渐从 Adam 转变为 SGD。这种方法的关键在于，它为每个参数设置了动态的学习率边界，以确保学习率不会过高或过低。AdaBound 最初由 Luo 等人在论文 "Adaptive Gradient Methods with Dynamic Bound of Learning Rate" 中提出。

1. AdaBound 与 AdamW 和 SGD 的区别

AdaBound：结合了 Adam 和 SGD 的优点，通过动态调整学习率边界来模仿 SGD 的学习率调度，旨在克服 Adam 在训练后期可能出现的性能下降问题。
AdamW：是 Adam 的一个变种，引入了权重衰减（L2 正则化）作为优化的一部分，而不是作为损失函数的一部分，以解决 Adam 优化器在正则化时可能出现的问题。
SGD：随机梯度下降，是最基本的优化算法，通常配合动量（Momentum）和学习率调度器使用，以加快收敛速度并提高模型的最终性能。

2. 安装adabound

pip install adabound

3. 使用adabound

optimizer = adabound.AdaBound(model.parameters(), lr=1e-3, final_lr=0.1)

7. AutoAugment

AutoAugment 是一种自动化的数据增强技术，主要应用于深度学习和机器学习领域，特别是在图像识别和处理任务中。其核心目标是自动发现最优的数据增强策略，从而提高模型的性能和泛化能力。数据增强是指在不改变原始数据标签的情况下，通过各种方法扩充训练数据集的技术，如旋转、翻转、缩放图像等，来提升模型对新未见数据的适应性和鲁棒性。

AutoAugment 通过搜索最优的数据增强策略来实现自动化增强。它利用强化学习，特别是一种称为控制器的模型，来预测给定任务的最佳增强策略。这个控制器通过尝试不同的增强操作（如图像的色彩平衡调整、剪裁、旋转等）并观察它们对模型性能的影响，来学习哪些操作可以提高模型的准确率

提高模型准确性：通过增加数据多样性，减少过拟合，使模型在未见过的数据上表现更好。
减少手工调参需求：自动化寻找最佳数据增强策略，减轻了人工选择和调整数据增强方法的负担。
增强模型泛化能力：更广泛的数据表示有助于模型在多种条件下保持性能，尤其是在现实世界的应用中。

"""AutoAugment data augmentation method based on
    `"AutoAugment: Learning Augmentation Strategies 
from Data" <https://arxiv.org/pdf/1805.09501.pdf>`_.
    If the image is torch Tensor, 
it should be of type torch.uint8, and it is expected
    to have [..., 1 or 3, H, W] shape,
 where ... means an arbitrary number of leading dimensions.
    If img is PIL Image, it is expected to be in mode "L" or "RGB".

    Args:
        policy (AutoAugmentPolicy): Desired policy enum defined by
            :class:`torchvision.transforms.autoaugment.AutoAugmentPolicy`. Default is ``AutoAugmentPolicy.IMAGENET``.
        interpolation (InterpolationMode): Desired interpolation enum defined by
            :class:`torchvision.transforms.InterpolationMode`. Default is ``InterpolationMode.NEAREST``.
            If input is Tensor, only ``InterpolationMode.NEAREST``, ``InterpolationMode.BILINEAR`` are supported.
        fill (sequence or number, optional): Pixel fill value for the area outside the transformed

import torch
import torchvision
from torchvision import datasets, transforms
from torchvision.transforms import AutoAugment, AutoAugmentPolicy

# 定义数据预处理和增强
#AutoAugmentPolicy 可以根据你的需求选择不同的策略（如 CIFAR10、SVHN 策略等）
transform = transforms.Compose([
    transforms.Resize((32, 32)),  # 调整图像大小
    AutoAugment(policy=AutoAugmentPolicy.IMAGENET),  # 应用 AutoAugment，这里以 ImageNet 策略为例
    transforms.ToTensor(),  # 将图像转换为 PyTorch 张量
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # 归一化
])

# 加载数据集（以 CIFAR10 为例）
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

# 创建数据加载器
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

# 下面可以定义和训练你的模型...