激活函数-swiGLU

news2025/7/14 14:21:18

swiGLU（Switch Gated Linear Unit）简介

swiGLU 是一种改进的激活函数模块，主要用于深度学习中的 Transformer 模型和其他神经网络架构。它在 GLU（Gated Linear Unit） 的基础上进行了修改，以提升模型的表现和训练效率。

1. 背景知识

在深度学习中，激活函数对模型的性能和训练效率有着显著的影响。最初的 GLU（Gated Linear Unit） 提出了通过将输入分成两部分来提高网络的表现：

公式： $\text{GLU}(x) = \text{Linear}(x) \otimes \sigma(\text{Linear}(x))$

其中：

$\text{Linear}(x)$ 表示线性变换。
$\sigma$ 是 Sigmoid 激活函数。
$\otimes$ 表示逐元素乘法（element-wise multiplication）。

2. swiGLU 结构

swiGLU 是对 GLU 的变体，结合了 Swish 激活函数（也称为 SiLU，Sigmoid Linear Unit）。Swish 函数表现出色，因为它具有非单调性和自门控的特性。

swiGLU 公式如下：

$\text{swiGLU}(x) = \text{Linear}(x) \otimes \text{Swish}(\text{Linear}(x))$

其中 Swish 激活函数 定义为：

$\text{Swish}(x) = x \cdot \sigma(x) = x \cdot \frac{1}{1 + e^{-x}}$

3. 工作原理

输入 (x) 经过两个并行的线性变换层：
- $x_1 = \text{Linear}_1(x)$
- $x_2 = \text{Linear}_2(x)$
将第一个线性变换 $x_1$ 与 Swish 激活函数 $\text{Swish}(x_2)$ 进行逐元素乘法：

$\text{swiGLU}(x) = x_1 \otimes \text{Swish}(x_2)$

4. 与 GLU 的区别

激活函数不同：
- GLU 使用 Sigmoid 作为门控激活函数。
- swiGLU 使用 Swish 作为激活函数。
性能提升：
- Swish 激活函数相比 Sigmoid 更具有优势，特别是在深层网络中。
- swiGLU 通过 Swish 提供更平滑的梯度，有助于更高效地训练深度神经网络。

5. 优点

提高性能：
- 在许多基准测试中，swiGLU 已被证明比 GLU、ReLU 及其他激活函数提供更好的表现。
平滑梯度：
- Swish 函数的平滑性使得反向传播的梯度更新更稳定，减轻梯度消失的问题。
计算效率：
- 尽管引入了额外的非线性激活函数，swiGLU 的计算开销相对较小，适合大型模型。

6. 应用场景

Transformer 模型：
- 在语言建模和自然语言处理任务中，如 GPT 系列和 BERT 的变体。
计算机视觉：
- 适用于视觉 Transformer（ViT）等结构。
任意深度网络：
- 适用于需要门控线性单元的任意网络。

示例代码

以下是一个使用 PyTorch 实现 swiGLU 的示例：

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SwiGLU, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.linear2 = nn.Linear(input_dim, hidden_dim)

    def forward(self, x):
        return self.linear1(x) * F.silu(self.linear2(x))

# 示例输入
x = torch.randn(4, 128)  # Batch size 4, input dimension 128
model = SwiGLU(input_dim=128, hidden_dim=256)
output = model(x)
print(output.shape)  # 输出维度为 (4, 256)