YOLO11改进|编码器篇|引入AIFI混合特征编码器

在这里插入图片描述

一、【AIFI】混合编码器机制

1.1【AIFI】混合编码器介绍

在这里插入图片描述

【AIFI】在论文中并没有结构图，让我们通过他的代码简单分析一下运行过程和优势

处理过程：

位置嵌入：
2D Sine-Cosine 位置嵌入(build_2d_sincos_position_embedding)：
AIFI 模块首先为输入数据生成了 2D 的 sine-cosine 位置嵌入。这个嵌入机制类似于传统 Transformer 中的 positional encoding，用于补充序列中的位置信息，但在 AIFI 中，它针对二维图像结构，生成了与图像的宽度（w）和高度（h）相关的正弦和余弦嵌入。生成的位置嵌入在每个维度（w 和 h）分别通过 sine 和 cosine 函数来编码，通过交替使用 sin 和 cos，可以确保不同空间位置上的编码是唯一的。
输入特征的处理：
输入展平与转置：原始输入 x 的形状为 [batch_size, channels, height, width]（即 [B, C, H, W]）。在进入 Transformer 之前，AIFI 首先将输入展平并转置成 [B, HxW, C] 的形状，这符合 Transformer 的输入格式，转换后的形状将每个像素点的特征看作一个序列元素，并保持通道数为 C。
位置嵌入也生成一个 [HxW, C] 的嵌入矩阵，与展平后的特征一起输入到 Transformer 中。
Transformer Encoder 层的应用：
多头自注意力 (MultiheadAttention)：输入展平后的特征进入 MultiheadAttention 层，进行自注意力计算。AIFI 模块继承自 TransformerEncoderLayer，在该层中，首先通过 q 和 k 进行查询和键的相似度计算，然后加权 v（值）得到新的特征表示。位置嵌入在此步骤中帮助网络学习到空间信息，从而增强不同位置之间的关联性。
残差连接与前馈网络：在自注意力操作后，特征通过残差连接回输入特征，并经过一个两层前馈网络（Feedforward Network）进一步提升特征表达能力，最后通过 LayerNorm 和 Dropout 操作进行正则化和防止过拟合。
恢复形状：
还原维度：经过 Transformer 层处理后的特征，形状为 [B, HxW, C]。此时 AIFI 模块将其重新恢复到原始的 [B, C, H, W] 形状。这个步骤将序列形式的特征还原成二维的图像特征图，便于后续的卷积或其他图像处理操作。

AIFI 模块的优势：

二维位置嵌入的有效性：
AIFI 模块引入了二维的 sine-cosine 位置嵌入，它可以有效捕捉图像中的空间结构和位置信息。这种位置嵌入方式具有简单、高效的特点，同时能提供位置信息给 Transformer 模块，帮助自注意力机制更好地学习空间上的关联性。与传统的 Transformer 位置嵌入（主要是处理1D序列）相比，AIFI 中的 2D 位置嵌入更适合处理图像数据。
适应图像的序列建模：
通过将图像展平为二维序列，AIFI 模块能够将 Transformer 应用于图像特征中，这使得网络能够学习跨位置的长距离依赖关系。自注意力机制可以在整个图像范围内捕捉到全局信息，这对提升模型的感知能力和整体表现非常有帮助。
Transformer 的全局建模能力：
Transformer 的多头自注意力机制允许模型在不同位置之间建立全局依赖关系，这与传统的卷积网络只关注局部邻域不同。AIFI 的 Transformer 层能够有效整合图像中的全局特征，使得它在处理具有复杂上下文或长距离依赖的图像任务时非常有效。
灵活的残差连接与归一化策略：
AIFI 通过继承 TransformerEncoderLayer，使用了残差连接和 LayerNorm 归一化，这使得网络能够更深更稳定地训练，同时也避免了梯度消失问题。残差连接还可以保留原始输入特征，帮助网络平衡新旧特征。
高效计算与参数共享：
AIFI 通过使用 1x1 卷积和展平操作，使得整个模块在保持高效计算的同时，能够处理图像数据中潜在的上下文信息。通过位置编码和 Transformer 层的组合，AIFI 可以在不显著增加计算量的情况下获得较好的全局特征建模效果。

1.2【AIFI】核心代码

import torch
import torch.nn as nn


class TransformerEncoderLayer(nn.Module):
    """Defines a single layer of the transformer encoder."""

    def __init__(self, c1, cm=2048, num_heads=8, dropout=0.0, act=nn.GELU(), normalize_before=False):
        """Initialize the TransformerEncoderLayer with specified parameters."""
        super().__init__()
        self.ma = nn.MultiheadAttention(c1, num_heads, dropout=dropout, batch_first=True)
        # Implementation of Feedforward model
        self.fc1 = nn.Linear(c1, cm)
        self.fc2 = nn.Linear(cm, c1)

        self.norm1 = nn.LayerNorm(c1)
        self.norm2 = nn.LayerNorm(c1)
        self.dropout = nn.Dropout(dropout)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

        self.act = act
        self.normalize_before = normalize_before

    @staticmethod
    def with_pos_embed(tensor, pos=None):
        """Add position embeddings to the tensor if provided."""
        return tensor if pos is None else tensor + pos

    def forward_post(self, src, src_mask=None, src_key_padding_mask=None, pos=None):
        """Performs forward pass with post-normalization."""
        q = k = self.with_pos_embed(src, pos)
        src2 = self.ma(q, k, value=src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        src2 = self.fc2(self.dropout(self.act(self.fc1(src))))
        src = src + self.dropout2(src2)
        return self.norm2(src)

    def forward_pre(self, src, src_mask=None, src_key_padding_mask=None, pos=None):
        """Performs forward pass with pre-normalization."""
        src2 = self.norm1(src)
        q = k = self.with_pos_embed(src2, pos)
        src2 = self.ma(q, k, value=src2, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src2 = self.norm2(src)
        src2 = self.fc2(self.dropout(self.act(self.fc1(src2))))
        return src + self.dropout2(src2)

    def forward(self, src, src_mask=None, src_key_padding_mask=None, pos=None):
        """Forward propagates the input through the encoder module."""
        if self.normalize_before:
            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
        return self.forward_post(src, src_mask, src_key_padding_mask, pos)


class AIFI(TransformerEncoderLayer):
    """Defines the AIFI transformer layer."""

    def __init__(self, c1, cm=2048, num_heads=8, dropout=0, act=nn.GELU(), normalize_before=False):
        """Initialize the AIFI instance with specified parameters."""
        super().__init__(c1, cm, num_heads, dropout, act, normalize_before)

    def forward(self, x):
        """Forward pass for the AIFI transformer layer."""
        c, h, w = x.shape[1:]
        pos_embed = self.build_2d_sincos_position_embedding(w, h, c)
        # Flatten [B, C, H, W] to [B, HxW, C]
        x = super().forward(x.flatten(2).permute(0, 2, 1), pos=pos_embed.to(device=x.device, dtype=x.dtype))
        return x.permute(0, 2, 1).view([-1, c, h, w]).contiguous()

    @staticmethod
    def build_2d_sincos_position_embedding(w, h, embed_dim=256, temperature=10000.0):
        """Builds 2D sine-cosine position embedding."""
        grid_w = torch.arange(int(w), dtype=torch.float32)
        grid_h = torch.arange(int(h), dtype=torch.float32)
        grid_w, grid_h = torch.meshgrid(grid_w, grid_h, indexing='ij')
        assert embed_dim % 4 == 0, \
            'Embed dimension must be divisible by 4 for 2D sin-cos position embedding'
        pos_dim = embed_dim // 4
        omega = torch.arange(pos_dim, dtype=torch.float32) / pos_dim
        omega = 1. / (temperature ** omega)

        out_w = grid_w.flatten()[..., None] @ omega[None]
        out_h = grid_h.flatten()[..., None] @ omega[None]

        return torch.cat([torch.sin(out_w), torch.cos(out_w), torch.sin(out_h), torch.cos(out_h)], 1)[None]

二、添加【AIFI】机制

2.1STEP1

首先找到ultralytics/nn文件路径下新建一个Add-module的python文件包【这里注意一定是python文件包，新建后会自动生成_init_.py】，如果已经跟着我的教程建立过一次了可以省略此步骤，随后新建一个AIFI.py文件并将上文中提到的注意力机制的代码全部粘贴到此文件中，如下图所示在这里插入图片描述

2.2STEP2

在STEP1中新建的_init_.py文件中导入增加改进模块的代码包如下图所示在这里插入图片描述

2.3STEP3

找到ultralytics/nn文件夹中的task.py文件，在其中按照下图添加在这里插入图片描述

2.4STEP4

定位到ultralytics/nn文件夹中的task.py文件中的def parse_model(d, ch, verbose=True): # model_dict, input_channels(3)函数添加如图代码,【如果不好定位可以直接ctrl+f搜索定位】

在这里插入图片描述

三、yaml文件与运行

3.1yaml文件

以下是添加【AIFI】机制在Backbone中的yaml文件，大家可以注释自行调节，效果以自己的数据集结果为准

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs

# YOLO11n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 2, C3k2, [256, False, 0.25]]
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 2, C3k2, [512, False, 0.25]]
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 2, C3k2, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 2, C3k2, [1024, True]]
  - [-1, 1, AIFI, [1024, 8]] # 9
  - [-1, 2, C2PSA, [1024]] # 10

# YOLO11n head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  - [-1, 2, C3k2, [512, False]] # 13

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  - [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)


  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 13], 1, Concat, [1]] # cat head P4
  - [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 10], 1, Concat, [1]] # cat head P5
  - [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)

  - [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)