YOLO5的修改

news2025/7/14 2:55:55

在传统的yolov5网络中并不存在注意力机制，但是源代码中存在相关简略的代码：

    def __init__(self, c, num_heads):
        """
        Initializes a transformer layer, sans LayerNorm for performance, with multihead attention and linear layers.

        See  as described in https://arxiv.org/abs/2010.11929.
        """
        super().__init__()
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)

    def forward(self, x):
        """Performs forward pass using MultiheadAttention and two linear transformations with residual connections."""
        x = self.ma(self.q(x), self.k(x), self.v(x))[0] + x
        x = self.fc2(self.fc1(x)) + x
        return x

实现了一个带有多头注意力机制和线性变换的Transformer层，并使用残差连接来增强模型的学习能力。

我们将代码修改为：

class TransformerLayer(nn.Module):
    """Transformer layer with multihead attention and linear layers, optimized by removing LayerNorm.

    Args:
        c (int): The dimension of the input embeddings.
        num_heads (int): The number of heads in the multiheadattention models.

    Returns:
        torch.Tensor: The output tensor after transformation.
    """

    def __init__(self, c, num_heads):
        """
        Initializes a transformer layer, sans LayerNorm for performance, with multihead attention and linear layers.

        Args:
            c (int): The dimension of the input embeddings.
            num_heads (int): The number of heads in the multiheadattention models.

        Raises:
            ValueError: If `c` or `num_heads` is not a positive integer.
        """
        super().__init__()
        if not isinstance(c, int) or c <= 0:
            raise ValueError("c must be a positive integer")
        if not isinstance(num_heads, int) or num_heads <= 0:
            raise ValueError("num_heads must be a positive integer")

        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)

        # Initialize weights
        nn.init.xavier_uniform_(self.q.weight)
        nn.init.xavier_uniform_(self.k.weight)
        nn.init.xavier_uniform_(self.v.weight)
        nn.init.xavier_uniform_(self.fc1.weight)
        nn.init.xavier_uniform_(self.fc2.weight)

    def forward(self, x):
        """Performs forward pass using MultiheadAttention and two linear transformations with residual connections and activation functions.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The output tensor after transformation.
        """
        try:
            attn_output, _ = self.ma(self.q(x), self.k(x), self.v(x))
            x = attn_output + x
            x = self.fc2(self.relu(self.fc1(x))) + x
            x = self.dropout(x)
            return x
            print("dropout=0.5")
        except Exception as e:
            print(f"Error during forward pass: {e}")
            raise

改进后的代码有以下优势：

对参数进行有效化检查（显然不需要，为了从字数呗）。
在两个线性层之间使用ReLU激活函数，引入非线性层，使得模型能够更好的拟合复杂的函数关系。
引入dropout层，减少过拟合。（不过这个参数可以调少一点）。
对线性层的各个权重进行Xavier均匀初始化（详细公式），加快收敛速度。

我在head的num方面取数量为8，即8头注意力机制。包含6个transformer层。

在不修改的时候运行的结果如下：

如果不对transformer层进行改进，而只是添加了该层进行训练，结果如下：

不巧的是，效果有显著的下降。

改进之后训练的结果如下：

调一下drop试试：（此时dropout=0.1）

应该是代码错了，修改下代码重来，代码修改如下：

import torch
import torch.nn as nn

class TransformerLayer(nn.Module):
    """Transformer layer with multihead attention and linear layers, optimized by removing LayerNorm."""

    def __init__(self, c, num_heads, dropout_rate=0.1):
        """
        Initializes a transformer layer, with multihead attention and linear layers.

        Args:
            c (int): The number of input/output channels.
            num_heads (int): The number of attention heads.
            dropout_rate (float): The dropout rate for regularization.
        """
        super().__init__()
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)
        self.dropout = nn.Dropout(dropout_rate)  # Dropout layer
        self.activation = nn.GELU()  # Activation function

    def forward(self, x):
        """Performs forward pass using MultiheadAttention and two linear transformations with residual connections."""
        # Multihead attention
        attn_output = self.ma(self.q(x), self.k(x), self.v(x))[0]
        x = attn_output + x  # Residual connection
        x = self.dropout(x)  # Apply dropout

        # Feedforward network
        x = self.fc1(x)
        x = self.activation(x)  # Activation function
        x = self.fc2(x) + attn_output  # Residual connection
        return x

省去了xavier均匀初始化，在第一个线性层后面添加了GELU激活函数，增强非线性的表达能力。每个主要操作后面都添加了残差连接，有助于梯度流动。（希望能有用吧，求求了！！！）

改进的结果如下：（算了，还是在xavier均匀化的基础上改进吧）。