nn.Embedding() 和 nn.Linear() 的区别

news2025/7/15 2:04:27

区别

nn.Embedding() 接收整数索引（如词汇表中的 Token ID），不要求固定输入维度，返回权重矩阵 $W$ 中对应的行向量，类似查找表操作。
nn.Linear() 接收一个向量输入（要求固定的输入维度 input_dim），返回线性变换后的结果（ $W x + b$ ），其中有偏置项。

运行代码：

import torch
import torch.nn as nn

# 设置随机数种子，确保结果可重复
torch.manual_seed(0)

# 假设有 5 个词，每个词的嵌入维度为 3
num_embeddings = 5
embedding_dim = 3

# 定义 nn.Embedding 层
embedding = nn.Embedding(num_embeddings=num_embeddings, embedding_dim=embedding_dim)

# 定义 nn.Linear 层，输入维度为词汇表大小，输出维度为嵌入维度
linear = nn.Linear(in_features=num_embeddings, out_features=embedding_dim, bias=False)

# 手动将 nn.Linear 的权重设置为 nn.Embedding 的权重的转置
with torch.no_grad():
    linear.weight.copy_(embedding.weight.transpose(0, 1))

# 输入的词索引
indices = torch.tensor([0, 1, 2, 2, 4])

# 打印输入的词索引
print("输入的词索引 (indices, token IDs):")
print(indices)

# 使用 nn.Embedding 获取嵌入向量
embedding_output = embedding(indices)

# 打印嵌入层（Embedding）的权重矩阵
print("\nEmbedding 层的权重矩阵 (embedding.weight):")
print(embedding.weight)

# 打印 Embedding 的输出
print("\n使用 nn.Embedding 获取的嵌入向量 (embedding_output):")
print(embedding_output)

# 将词索引转换为 one-hot 向量
one_hot_input = nn.functional.one_hot(indices, num_classes=num_embeddings).float()

# 打印 one-hot 输入
print("\n转换后的 one-hot 向量 (one_hot_input):")
print(one_hot_input)

# 使用 nn.Linear 获取嵌入向量
linear_output = linear(one_hot_input)

# 打印线性层（Linear）的权重矩阵
print("\nLinear 层的权重矩阵 (linear.weight):")
print(linear.weight)

# 打印 Linear 的输出
print("\n使用 nn.Linear 获取的嵌入向量 (linear_output):")
print(linear_output)

# 将 indices 直接传入 nn.Linear
try:
    linear_indices_output = linear(indices.float())
    print("\n直接将 indices 传入 nn.Linear 的输出 (linear_indices_output):")
    print(linear_indices_output)
except Exception as e:
    print("\n直接将 indices 传入 nn.Linear 时发生错误:")
    print(e)

输出：

输入的词索引 (indices, token IDs):
tensor([0, 1, 2, 2, 4])

Embedding 层的权重矩阵 (embedding.weight):
Parameter containing:
tensor([[ 1.5410, -0.2934, -2.1788],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.4033,  0.8380, -0.7193],
        [-0.4033, -0.5966,  0.1820],
        [-0.8567,  1.1006, -1.0712]], requires_grad=True)

使用 nn.Embedding 获取的嵌入向量 (embedding_output):
tensor([[ 1.5410, -0.2934, -2.1788],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.4033,  0.8380, -0.7193],
        [ 0.4033,  0.8380, -0.7193],
        [-0.8567,  1.1006, -1.0712]], grad_fn=<EmbeddingBackward0>)

转换后的 one-hot 向量 (one_hot_input):
tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1.]])

Linear 层的权重矩阵 (linear.weight):
Parameter containing:
tensor([[ 1.5410,  0.5684,  0.4033, -0.4033, -0.8567],
        [-0.2934, -1.0845,  0.8380, -0.5966,  1.1006],
        [-2.1788, -1.3986, -0.7193,  0.1820, -1.0712]], requires_grad=True)

使用 nn.Linear 获取的嵌入向量 (linear_output):
tensor([[ 1.5410, -0.2934, -2.1788],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.4033,  0.8380, -0.7193],
        [ 0.4033,  0.8380, -0.7193],
        [-0.8567,  1.1006, -1.0712]], grad_fn=<MmBackward0>)

直接将 indices 传入 nn.Linear 的输出 (linear_indices_output):
tensor([-2.8583,  3.8007, -6.7578], grad_fn=<SqueezeBackward4>)

尝试：

将 indices 改为 torch.tensor([0, 2, 4])，观察输出。
将 indices 改为 torch.tensor([0, 1, 2, 2, 4, 0])，观察输出。
将 indices 改为 torch.tensor([0, 1, 2, 2, 5])，观察输出。

代码差异

权重矩阵的形状

Embedding 的权重矩阵形状为 (num_embeddings, embedding_dim)，即 (输入维度, 输出维度)，没有偏置项。
Linear 的权重矩阵形状为 (output_dim, input_dim)，即 (输出维度, 输入维度)，且包含一个偏置向量。

初始化示例：

# Embedding
def __init__(self, num_embeddings, embedding_dim):
    self.weight = torch.nn.Parameter(torch.randn(num_embeddings, embedding_dim))

# Linear
def __init__(self, input_dim, output_dim):
    self.weight = nn.Parameter(torch.randn(output_dim, input_dim))
    self.bias = nn.Parameter(torch.randn(output_dim))

输入处理方式

Embedding 将离散的输入（如 Token ID）映射到连续的嵌入向量空间，更通俗一点就是查找表，把输入当成索引，返回权重矩阵对应的行。
Linear 则对输入向量进行矩阵乘法，再加上偏置，实现线性变换。

示例代码：

# Embedding
def forward(self, input):
    return self.weight[input]  # 返回权重矩阵的对应行

# Linear
def forward(self, input):
    return torch.matmul(input, self.weight.T) + self.bias  # 线性变换

数学表达

Embedding

假设词汇表大小为 $V$ ，嵌入维度为 $D$ ，嵌入层表示为矩阵 $\in \mathbb{R}^{V \times D}$ 。对于输入 token ID 序列 $x_1, x_2, \dots, x_n$ ，嵌入层输出对应的嵌入向量 $E_{x_1}, E_{x_2}, \dots, E_{x_n}$ ，其中每个 $E_{x_i} \in \mathbb{R}^{D}$ 。可以表示为：
$E(x_i) = E_{x_i}$

$E$ ：嵌入矩阵，从代码上看就是权重矩阵 $W$ 。
$x_i$ ：是输入的 Token ID，就是索引。
$E(x_i)$ ：嵌入向量，就是索引对应的行。

Linear

给定输入向量 $\in \mathbb{R}^{n}$ 和输出向量 $\in \mathbb{R}^{m}$ ，线性层的变换表示为：
$y = W x + b$

$W$ ：权重矩阵，维度为 $\times n$ ，决定输入到输出空间的映射。
$x$ ：输入向量，维度为 $n$ 。
$b$ ：偏置项，维度为 $m$ 。
$y$ ：输出向量，维度为 $m$ 。

示例

如果输入向量 $\mathbf{x}$ 有 3 个特征，输出向量 $\mathbf{y}$ 有 2 个特征，则权重矩阵 $\mathbf{W}$ 的形状为 $\times 3$ 。假设：

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad b = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$

线性变换计算为：

$\cdot x + b = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 14 \\ 32 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 14 \\ 33 \end{bmatrix}$

矩阵运算过程：
$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} = \begin{bmatrix} (1 \times 1) + (2 \times 2) + (3 \times 3) \\ (4 \times 1) + (5 \times 2) + (6 \times 3) \end{bmatrix} = \begin{bmatrix} 14 \\ 32 \end{bmatrix}$
拓展阅读：