深度学习------------------------RNN（循环神经网络）

潜变量自回归模型
循环神经网络
困惑度
梯度剪裁
循环神经网络的从零开始实现
- 初始化循环神经网络模型的模型参数
- 初始化隐藏状态
- 创建一个类来包装这些函数
- - 该部分总代码
- 定义预测函数
- - 该部分总代码
- 梯度裁剪
- 定义一个函数在一个迭代周期内训练模型
- 训练函数
循环神经网络的简洁实现
- 加载数据集
- 定义模型
- 定义RNNModel类
- 训练与预测
- - 该部分总代码
- 使用高级API训练模型
- 总代码

第一个对序列模型的神经网络叫做循环神经网络

潜变量自回归模型

使用潜变量 $h_t$ 总结过去信息。

在这里插入图片描述

解释： $x_t$ 与当前的 $h_t$ 和 $x_{t-1}$ 相关。
$h_t$ 与 $x_{t-1}$ 和 $h_{t-1}$ 相关

循环神经网络

在这里插入图片描述

解释： $o_t$ 是根据 $h_t$ 输出的， $h_t$ 不能用 $x_t$ 的东西，用的是 $x_{t-1}$

在这里插入图片描述

解释：假设我的观察是"你"的话，那么会更新我的隐变量，然后预测"好"字( $o_t$ 这行的)。接下来观察到了"好"( $x_t$ 这行的)。那么去更新我的下一个隐变量，再输出下一个逗号，依次类推。

$o_t$ 是用来match到你的 $x_t$ 的输入，但在生成 $o_t$ 的时候，不能看到 $x_t$ 。即：当前时刻的输出是预测当前时刻的观察，但是输出发生在观察之前。（ $o_t$ 是根据 $h_t$ 输出的，但 $h_t$ 用的是 $x_{t-1}$ 的东西，然后再计算损失的时候，比较 $o_t$ 和 $x_t$ 之间的关系计算损失。 $x_t$ 是用来更新 $h_t$ 使它挪到下一个单元。）

在这里插入图片描述

$W_{hh}$ ∈ $R^{h×h}$ 用来描述如何在当前时间步中使用前一个时间步的隐藏变量。（最简单的RNN通过 $W_{hh}$ 来储存时序信息的。这是和MLP的区别。）

困惑度

衡量一个语言模型的好坏可以用平均交叉熵。

在这里插入图片描述

梯度剪裁

在这里插入图片描述

循环神经网络的从零开始实现

import torch
from d2l import torch as d2l
from torch.nn import functional as F

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# torch.tensor([0, 2])是一个长为2的向量，加入len(vocab)会添加一个维度，那么结果就是2*len(vocab)
# 下标0在第零个词中的第一个位置置为1，下标1在此词中的第三个位置置为1
print(F.one_hot(torch.tensor([0, 2]), len(vocab)))

在这里插入图片描述

小批量数据形状是(批量大小，时间步数)

# 先构造一个X重构成批量大小为2，时间步数是5
X = torch.arange(10).reshape((2, 5))
# 转置的目的调换第一维度和第二维度，调换后第一个维度是时间步数、第二个维度是批量大小、第三个维度是特征长度
# 每个时间步（第一个参数）都有一个Xt,就是第二个参数和第三个参数
print(F.one_hot(X.T, 28).shape)

初始化循环神经网络模型的模型参数

# 初始化循环神经网络模型的模型参数
def get_params(vocab_size, num_hiddens, device):
    # 设置输入和输出的维度为词汇表大小
    num_inputs = num_outputs = vocab_size

    # 定义normal函数用于生成服从正态分布的随机张量，并乘以0.01进行缩放
    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    # 初始化模型参数
    # 输入到隐藏层的权重矩阵，形状为(词汇表大小, 隐藏单元个数)
    W_xh = normal((num_inputs, num_hiddens))
    # 隐藏层到隐藏层的权重矩阵，形状为(隐藏单元个数, 隐藏单元个数)
    W_hh = normal((num_hiddens, num_hiddens))
    # 隐藏层的偏置向量，形状为(隐藏单元个数,)
    b_h = torch.zeros(num_hiddens, device=device)
    # 隐藏层到输出层的权重矩阵，形状为(隐藏单元个数, 词汇表大小)
    W_hq = normal((num_hiddens, num_outputs))
    # 输出层的偏置向量，形状为(词汇表大小,)
    b_q = torch.zeros(num_outputs, device=device)
    # 将所有参数放入列表中
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    # 遍历所有参数
    for param in params:
        # 设置参数的requires_grad为True，用于梯度计算
        param.requires_grad_(True)
    # 返回模型的参数
    return params

初始化隐藏状态

一个init_rnn_state函数在初始化时返回隐藏状态。

# 用于在0时刻是给定一个初始化的隐状态
def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),)

RNN函数定义了如何在一个时间步内计算隐藏状态和输出。

# 下面的rnn函数定义了如何在一个时间步计算隐藏状态和输出
# 第一个参数是给定的输入X，X里包括了所有的时间步X0、X1、X2、...XT。共T个样本
def rnn(inputs, state, params):
    # 从参数元组中解包获取输入到隐藏层的权重矩阵 W_xh，
    # 隐藏层到隐藏层的权重矩阵 W_hh，
    # 隐藏层的偏置向量 b_h，
    # 隐藏层到输出层的权重矩阵 W_hq，
    # 输出层的偏置向量 b_q
    W_xh, W_hh, b_h, W_hq, b_q = params
    # 从状态元组中解包获取隐藏状态 H
    # 注意这里使用逗号是为了确保 H 为一个元组
    H, = state
    # 创建一个空列表用于存储输出
    outputs = []

    # 沿着第一维度进行遍历，inputs的形状：(时间步数量，批量大小，词表大小)
    for X in inputs:
    	# 首先拿到的是时刻0对应的X，就是批量大小×词表大小
        # 计算新的隐藏状态 H，使用双曲正切函数作为激活函数
        # 根据当前输入 X、上一时间步的隐藏状态 H、以及权重矩阵和偏置向量来计算
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        # 计算输出 Y，通过隐藏状态 H 与权重矩阵 W_hq 相乘并加上偏置向量 b_q 得到
        # Y是当前时刻的预测，在当前时刻预测下一个时刻的那个词是谁。
        # 把所有时刻的输出放在outputs里面
        Y = torch.mm(H, W_hq) + b_q
        # 将输出 Y 添加到输出列表中
        outputs.append(Y)

    # 输出：对于所有的Y之前每一个时刻的输出就是一个批量大小×vocab_size的一个东西（就是对每一个样本预测的那个向量就是一个vocab长度的向量）
    # 将输出列表中的输出张量沿着行维度进行拼接，得到一个形状为 (时间步数 * 批量大小, 输出维度) 的张量
    # 假设Y的形状为(1,Q)，那么outputs 为 [(1,Q), (1,Q), ...]共T个，所以拼接后的形状为(T*批量大小,Q)
    # 返回拼接后的输出张量和最后一个时间步的隐藏状态 H
    return torch.cat(outputs, dim=0), (H,)

说明一下cat方法

import torch

# 第一个批次
batch1 = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(batch1)
# 第二个批次
batch2 = torch.tensor([[7, 8, 9], [10, 11, 12]])
print(batch2)
# 沿着第一个维度（即批次维度）连接这两个批次
combined_batch = torch.cat([batch1, batch2], dim=0)
print(combined_batch)

创建一个类来包装这些函数

# 创建一个类来包装这些函数
class RNNModelScratch:
    # 初始化模型参数
    def __init__(self, vocab_size, num_hiddens, device, get_params,
                init_state, forward_fn):
        # 保存词汇表大小和隐藏单元个数作为类的属性
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        # 调用 get_params 函数初始化模型的参数，并保存为类的属性
        # 参数包括输入到隐藏层的权重矩阵、隐藏层到隐藏层的权重矩阵、隐藏层的偏置向量、隐藏层到输出层的权重矩阵、输出层的偏置向量
        self.params = get_params(vocab_size, num_hiddens, device)
        # 初始化隐藏状态的函数和前向传播函数
        self.init_state, self.forward_fn = init_state, forward_fn
        
    def __call__(self, X, state):
        # 将输入序列 X 进行独热编码，形状为 (时间步数, 批量大小, 词汇表大小)
        # 并将数据类型转换为浮点型
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        # 调用前向传播函数进行模型计算，并返回输出
        return self.forward_fn(X, state, self.params)
    
    def begin_state(self, batch_size, device):
        # 返回初始化的隐藏状态，用于模型的初始时间步
        return self.init_state(batch_size, self.num_hiddens, device)

当调用 predict_ch8 函数时，它会生成一系列的输入（通过 get_input 函数），这些输入被传递给 net（即 RNNModelScratch 的实例）。net 的 call 方法随后被调用，并且 X（在这里是通过 get_input 函数生成的独热编码张量）和当前的隐藏状态 state 作为参数传入。

输入数据 X  
  ↓  
  ↓  
One-Hot 编码 (vocab_size=28)  
  ↓  
  ↓  
转换为 float32  
  ↓  
前向传播函数 self.forward_fn  
  ↓  
输出结果

该部分总代码

import random
import torch
from d2l import torch as d2l
from torch.nn import functional as F


def get_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    W_xh = normal((num_inputs, num_hiddens))
    W_hh = normal((num_hiddens, num_hiddens))
    b_h = torch.zeros(num_hiddens, device=device)
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params


def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),)


def rnn(inputs, state, params):
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []

    for X in inputs:
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        Y = torch.mm(H, W_hq) + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H,)


class RNNModelScratch:
    def __init__(self, vocab_size, num_hiddens, device, get_params,
                 init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)


batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
X = torch.arange(10).reshape((2, 5))
num_hiddens = 512
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params,
                      init_rnn_state, rnn)
state = net.begin_state(X.shape[0], d2l.try_gpu())
Y, new_state = net(X.to(d2l.try_gpu()), state)
# new_state[0].shape的参数是[批量大小,隐藏源]
print(Y.shape, len(new_state), new_state[0].shape)

在这里插入图片描述

定义预测函数

def predict_ch8(prefix, num_preds, net, vocab, device):
    """在 'prefix' 后面生成新字符。"""
    # 获取模型的初始隐藏状态，批量大小为 1（对一个字符串做预测），设备为指定的设备
    state = net.begin_state(batch_size=1, device=device)
    # 将 prefix 的第一个字符索引添加到输出列表中
    outputs = [vocab[prefix[0]]]
    # 定义一个函数 get_input，用于获取输入序列的张量表示
    # 输入序列只包含一个字符，将该字符的索引转换为张量，并进行形状调整为 (1, 1)
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape(1, 1)
    # 对于 prefix 中除第一个字符之外的每个字符 y
    for y in prefix[1:]:
        # 使用当前输入字符和隐藏状态进行前向传播计算，得到输出和更新后的隐藏状态
        _, state = net(get_input(), state)
        # 将字符 y 的索引添加到输出列表中
        outputs.append(vocab[y])
    # 生成指定数量的新字符
    for _ in range(num_preds):
        # 使用当前输入字符和隐藏状态进行前向传播计算，得到输出和更新后的隐藏状态
        y, state = net(get_input(), state)
        # 将输出张量中概率最大的字符索引添加到输出列表中
        outputs.append(int(y.argmax(dim=1).reshape(1)))
    # 将输出列表中的字符索引转换为对应的字符，并拼接成一个字符串返回
    return ''.join([vocab.idx_to_token[i] for i in outputs])

解释代码：

    for y in prefix[1:]:
        # 这里第一次调用使用的是prefix[0]的数据，并且和其隐藏状态丢进net模型，但是不关心输出，只是用来初始化状态。
    	# 假设给出一段词(你好)那么先用"你"来预测"好"的时候不需要把预测存储下来，因为已经告诉标准答案了。唯一干的事情就是通过你好来初始化状态(隐变量)，把prefix里面的信息放到state里面，outputs里面是用的真实的prefix而不是用的预测。（这样就不用累计误差了）
        _, state = net(get_input(), state)
        # 将字符 y 的索引添加到输出列表中
        # outputs是用的真实的那个prefix，而不是预测（这样就不会累计误差了）
        outputs.append(vocab[y])

在这里插入图片描述
把所有的前缀存在state里面后，就可以真正的预测了，做num_preds次。

 for _ in range(num_preds):
        # 使用当前输入字符和隐藏状态进行前向传播计算，得到输出和更新后的隐藏状态
        # 每一次把前一个时刻的预测做成输入并更新state拿到输出y
        y, state = net(get_input(), state)
        # .argmax(dim=1) 方法在指定的维度（这里是第二维，即类别维度）上查找最大值的索引。y的形状为[批量大小，分类数]，
        # 将输出张量中概率最大的字符索引添加到输出列表中
        outputs.append(int(y.argmax(dim=1).reshape(1)))

解释结束

定义预测函数来生成prefix之后的新字符

# 生成以 'time traveller ' 为前缀的 10 个新字符
# 注意：由于模型尚未训练，这里的预测结果是随机初始化后的预测
print(predict_ch8('time traveller ', 10, net, vocab, d2l.try_gpu()))

在这里插入图片描述

该部分总代码

import random
import torch
from d2l import torch as d2l
from torch.nn import functional as F


# 初始化循环神经网络模型的模型参数
def get_params(vocab_size, num_hiddens, device):
    # 设置输入和输出的维度为词汇表大小
    num_inputs = num_outputs = vocab_size

    # 定义normal函数用于生成服从正态分布的随机张量，并乘以0.01进行缩放
    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    # 初始化模型参数
    # 输入到隐藏层的权重矩阵，形状为(词汇表大小, 隐藏单元个数)
    W_xh = normal((num_inputs, num_hiddens))
    # 隐藏层到隐藏层的权重矩阵，形状为(隐藏单元个数, 隐藏单元个数)
    # 上一时刻的隐藏变量变换到下一时刻的隐藏变量，上一时刻的隐藏变量和下一时刻的隐藏变量长度是一样的
    W_hh = normal((num_hiddens, num_hiddens))
    # 隐藏层的偏置向量，形状为(隐藏单元个数,)
    b_h = torch.zeros(num_hiddens, device=device)
    # 隐藏层到输出层的权重矩阵，形状为(隐藏单元个数, 词汇表大小)
    W_hq = normal((num_hiddens, num_outputs))
    # 输出层的偏置向量，形状为(词汇表大小,)
    b_q = torch.zeros(num_outputs, device=device)
    # 将所有参数放入列表中
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    # 遍历所有参数
    for param in params:
        # 设置参数的requires_grad为True，用于梯度计算
        param.requires_grad_(True)
    # 返回模型的参数
    return params


# 为什么要初始化隐藏状态？因为0时刻的时候没有上一刻的隐藏状态。
# 该函数用于在0时刻给定一个初始化的隐藏状态
def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),)


# 下面的rnn函数定义了如何在一个时间步计算隐藏状态和输出
# 第一个参数是给定的输入X，X里包括了所有的时间步X0、X1、X2、...XT。
def rnn(inputs, state, params):
    # 从参数元组中解包获取输入到隐藏层的权重矩阵 W_xh，
    # 隐藏层到隐藏层的权重矩阵 W_hh，
    # 隐藏层的偏置向量 b_h，
    # 隐藏层到输出层的权重矩阵 W_hq，
    # 输出层的偏置向量 b_q
    W_xh, W_hh, b_h, W_hq, b_q = params
    # 从状态元组中解包获取隐藏状态 H
    # 注意这里使用逗号是为了确保 H 为一个元组
    H, = state
    # 创建一个空列表用于存储输出
    outputs = []

    # 沿着第一维度进行遍历
    for X in inputs:
        # 计算新的隐藏状态 H，使用双曲正切函数作为激活函数
        # 根据当前输入 X、上一时间步的隐藏状态 H、以及权重矩阵和偏置向量来计算
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        # 计算输出 Y，通过隐藏状态 H 与权重矩阵 W_hq 相乘并加上偏置向量 b_q 得到
        # Y是当前时刻的预测，在当前时刻预测下一个时刻的那个词是谁。
        Y = torch.mm(H, W_hq) + b_q
        # 将输出 Y 添加到输出列表中
        outputs.append(Y)
    # 将输出列表中的输出张量沿着行维度进行拼接，得到一个形状为 (时间步数 * 批量大小, 输出维度) 的张量
    # 假设Y的形状为(1,Q)，那么outputs 为 [(1,Q), (1,Q), ...]共T个，所以拼接后的形状为(T*批量大小,Q)
    # 返回拼接后的输出张量和最后一个时间步的隐藏状态 H
    return torch.cat(outputs, dim=0), (H,)


# 创建一个类来包装这些函数
class RNNModelScratch:
    # 初始化模型参数
    def __init__(self, vocab_size, num_hiddens, device, get_params,
                 init_state, forward_fn):
        # 保存词汇表大小和隐藏单元个数作为类的属性
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        # 调用 get_params 函数初始化模型的参数，并保存为类的属性
        # 参数包括输入到隐藏层的权重矩阵、隐藏层到隐藏层的权重矩阵、隐藏层的偏置向量、隐藏层到输出层的权重矩阵、输出层的偏置向量
        self.params = get_params(vocab_size, num_hiddens, device)
        # 初始化隐藏状态的函数和前向传播函数
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        # 将输入序列 X 进行独热编码，形状为 (时间步数, 批量大小, 词汇表大小)
        # 并将数据类型转换为浮点型
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        # 调用前向传播函数进行模型计算，并返回输出
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        # 返回初始化的隐藏状态，用于模型的初始时间步
        # 初始化状态：批量大小为32，隐藏层数量为512
        return self.init_state(batch_size, self.num_hiddens, device)


# 首先定义预测函数来生成用户提供的prefix之后的新字符
def predict_ch8(prefix, num_preds, net, vocab, device):
    """在 'prefix' 后面生成新字符。"""
    # 获取模型的初始隐藏状态，批量大小为 1，设备为指定的设备
    state = net.begin_state(batch_size=1, device=device)
    # 将第一个字符（prefix[0]）放到vocab里面拿到对应的整型的下标放到outputs里
    outputs = [vocab[prefix[0]]]
    # 定义一个函数 get_input，用于获取输入序列的张量表示
    # 把output最后一个词存下来，当预测完的那个最近预测的词，最近预测的那个词作为下一个预测的输入
    # 输入序列只包含一个字符，将该字符的索引转换为张量，并进行形状调整为 (1, 1)
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape(1, 1)
    # 对于 prefix 中除第一个字符之外的每个字符 y
    for y in prefix[1:]:
        # 使用当前输入字符和隐藏状态进行前向传播计算，得到输出和更新后的隐藏状态
        _, state = net(get_input(), state)
        # 将当前字符y
        outputs.append(vocab[y])
    # 生成指定数量的新字符
    for _ in range(num_preds):
        # 使用当前输入字符和隐藏状态进行前向传播计算，得到输出和更新后的隐藏状态
        y, state = net(get_input(), state)
        # 将输出张量中概率最大的字符索引添加到输出列表中
        outputs.append(int(y.argmax(dim=1).reshape(1)))
    # 将输出列表中的字符索引转换为对应的字符，并拼接成一个字符串返回
    return ''.join([vocab.idx_to_token[i] for i in outputs])


batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
X = torch.arange(10).reshape((2, 5))

# 检查输出是否具有正确的形状
# 设置隐藏单元个数为 512
num_hiddens = 512
# 创建一个 RNNModelScratch 的实例 net，指定词汇表大小、隐藏单元个数、设备、获取参数函数、初始化隐藏状态函数和前向传播函数
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params,
                      init_rnn_state, rnn)
# 获取模型的初始隐藏状态，输入的批量大小为 X 的行数，设备使用与 X 相同的设备
state = net.begin_state(X.shape[0], d2l.try_gpu())
# 使用输入 X 和初始隐藏状态进行前向传播计算，得到输出张量 Y 和更新后的隐藏状态 new_state
# 将输入和状态都移动到与 X 相同的设备上进行计算
Y, new_state = net(X.to(d2l.try_gpu()), state)
# 输出 Y 的形状，new_state 的长度（即元素个数）和 new_state 中第一个元素的形状
print(Y.shape, len(new_state), new_state[0].shape)
# 生成以 'time traveller ' 为前缀的 10 个新字符
# 注意：由于模型尚未训练，这里的预测结果是随机初始化后的预测
print(predict_ch8('time traveller ', 10, net, vocab, d2l.try_gpu()))

import random
import torch
from d2l import torch as d2l
from torch.nn import functional as F

# 28,512
def get_params(vocab_size, num_hiddens, device):
    # 输入和输出都是28
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    W_xh = normal((num_inputs, num_hiddens)) # (vocab_size,num_hiddens)➡(28,512)
    W_hh = normal((num_hiddens, num_hiddens)) # (num_hiddens, num_hiddens)➡(512,512)
    b_h = torch.zeros(num_hiddens, device=device) # (num_hiddens,)➡(512,)
    W_hq = normal((num_hiddens, num_outputs)) # (num_hiddens, vocab_size)➡(512, 28)
    b_q = torch.zeros(num_outputs, device=device) # (vocab_size,)➡(28,)
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params


# 32，512
def init_rnn_state(batch_size, num_hiddens, device):
    # 生成一个32行512列的零矩阵
    return (torch.zeros((batch_size, num_hiddens), device=device),)


def rnn(inputs, state, params):
    W_xh, W_hh, b_h, W_hq, b_q = params
    # H的形状为(batch_size, num_hiddens)➡(32,512)
    H, = state
    outputs = []
    # X的形状(batch_size, vocab_size)➡(32，28)
    for X in inputs:
        #  (32,28)×(28,512)=（32，512）和(32,512)×(512,512)=（32，512）以及（512)相加的时候对b_h进行广播
        # 广播到形状 (batch_size, num_hiddens)，其中 b_h 中的每个元素都会被复制到 batch_size 次，以便与二维张量的每一行相加。
        # ∴H的形状为(batch_size, num_hiddens)➡(32,512)
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        # (32,512)×(512, 28)=(32,28)和(28,)
        # Y的形状为(batch_size, vocab_size)➡(32,28)
        Y = torch.mm(H, W_hq) + b_q
        # ⭐ outputs是一个列表不是一个张量，但每个元素的维度都是(batch_size, vocab_size)
        outputs.append(Y)
    # 按照输入顺序连接，输出的维度是(num_steps * batch_size, vocab_size)
    return torch.cat(outputs, dim=0), (H,)


class RNNModelScratch:
    # 初始化模型参数
    def __init__(self, vocab_size, num_hiddens, device, get_params,
                 init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)


def predict_ch8(prefix, num_preds, net, vocab, device):
    """在 'prefix' 后面生成新字符。"""
    #
    state = net.begin_state(batch_size=1, device=device)
    outputs = [vocab[prefix[0]]]
    # 预测的时候一次只处理一个字符，[outputs[-1]]是最后一个元素的数据（不是索引），重构为形状(1,1)第一个维度是批量大小，第二个维度是索引数
    # lambda 函数被用来创建一个匿名函数，这个函数没有参数
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape(1, 1)

    for y in prefix[1:]:
        # state的形状为(1, num_hiddens)➡(1, 512)
        _, state = net(get_input(), state)
        outputs.append(vocab[y])
    # 预测十个字符：（num_preds）
    for _ in range(num_preds):
        # y的形状为(1, vocab_size)➡(1,28)
        y, state = net(get_input(), state)
        # outputs是存储整个生成序列的索引，y.argmax(dim=1)将返回每个样本中概率最高的词(这里是一个字符)的索引。
        outputs.append(int(y.argmax(dim=1).reshape(1)))
    return ''.join([vocab.idx_to_token[i] for i in outputs])


batch_size, num_steps = 32, 35
num_hiddens = 512
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn)
print(predict_ch8('time traveller ', 10, net, vocab, d2l.try_gpu()))

在这里插入图片描述

梯度裁剪

在这里插入图片描述

def grad_clipping(net, theta):
    """裁剪梯度。"""
    # 如果 net 是 nn.Module 的实例（即使用 PyTorch 构建的模型）
    if isinstance(net, nn.Module):
        # 获取所有需要计算梯度的参数列表
        params = [p for p in net.parameters() if p.requires_grad]
    # 如果 net 是自定义的模型（例如上述的 RNNModelScratch）
    else:
        # 获取自定义模型的参数列表
        params = net.params
    # 计算参数梯度的范数，即所有参数梯度平方和的平方根
    norm = torch.sqrt(sum(torch.sum((p.grad**2)) for p in params))
    # 如果梯度范数超过指定阈值 theta
    if norm > theta:
        # 对于每个参数
        for param in params:
            # 将参数的梯度值裁剪至指定范围内，保持梯度范数不超过 theta
            param.grad[:] *= theta / norm

定义一个函数在一个迭代周期内训练模型

#@save
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
    """训练网络一个迭代周期（定义见第8章）"""
    state, timer = None, d2l.Timer()
    metric = d2l.Accumulator(2)  # 训练损失之和,词元数量
    for X, Y in train_iter:
        if state is None or use_random_iter:
            # 在第一次迭代或使用随机抽样时初始化state
            state = net.begin_state(batch_size=X.shape[0], device=device)
        else:
            if isinstance(net, nn.Module) and not isinstance(state, tuple):
                # state对于nn.GRU是个张量
                state.detach_() 
                # 读取新的iter后，将隐状态从计算图中分离出来，以避免不必要的梯度计算，从而提高效率和减少内存使用
            else:
                # state对于nn.LSTM或对于我们从零开始实现的模型是个元组，每个元素是一个张量，需要遍历每个元素
                for s in state:
                    s.detach_()
        # 把Y的形状转为(时间步数×批量大小，词表大小)
        y = Y.T.reshape(-1)
        X, y = X.to(device), y.to(device)
        y_hat, state = net(X, state) # 前向传播
        l = loss(y_hat, y.long()).mean()
        if isinstance(updater, torch.optim.Optimizer):
            updater.zero_grad()
            l.backward()
            # 在参数更新前进行梯度裁剪
            grad_clipping(net, 1)
            updater.step()
        else:
            l.backward()
            grad_clipping(net, 1)
            # 因为已经调用了mean函数
            updater(batch_size=1)
        metric.add(l * y.numel(), y.numel())
    # 输出 困惑度，运行速度
    return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()

训练函数

循环神经网络模型的训练函数即支持从零开始实现，也可以使用高级API来实现。

import math
import torch
from torch import nn
from d2l import torch as d2l
from torch.nn import functional as F


# 28,512
def get_params(vocab_size, num_hiddens, device):
    # 输入和输出都是28
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    W_xh = normal((num_inputs, num_hiddens))  # (vocab_size,num_hiddens)➡(28,512)
    W_hh = normal((num_hiddens, num_hiddens))  # (num_hiddens, num_hiddens)➡(512,512)
    b_h = torch.zeros(num_hiddens, device=device)  # (num_hiddens,)➡(512,)
    W_hq = normal((num_hiddens, num_outputs))  # (num_hiddens, vocab_size)➡(512, 28)
    b_q = torch.zeros(num_outputs, device=device)  # (vocab_size,)➡(28,)
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params


# 32，512
def init_rnn_state(batch_size, num_hiddens, device):
    # 生成一个32行512列的零矩阵
    return (torch.zeros((batch_size, num_hiddens), device=device),)


def rnn(inputs, state, params):
    W_xh, W_hh, b_h, W_hq, b_q = params
    # H的形状为(batch_size, num_hiddens)➡(32,512)
    H, = state
    outputs = []
    # X的形状(batch_size, vocab_size)➡(32，28)
    for X in inputs:
        #  (32,28)×(28,512)=（32，512）和(32,512)×(512,512)=（32，512）以及（512)相加的时候对b_h进行广播
        # 广播到形状 (batch_size, num_hiddens)，其中 b_h 中的每个元素都会被复制到 batch_size 次，以便与二维张量的每一行相加。
        # ∴H的形状为(batch_size, num_hiddens)➡(32,512)
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        # (32,512)×(512, 28)=(32,28)和(28,)
        # Y的形状为(batch_size, vocab_size)➡(32,28)
        Y = torch.mm(H, W_hq) + b_q
        # ⭐ outputs是一个列表不是一个张量，但每个元素的维度都是(batch_size, vocab_size)
        outputs.append(Y)
    # 按照输入顺序连接，输出的维度是(num_steps * batch_size, vocab_size)这里步数为1，批量大小也是1。这里的torch.cat(outputs, dim=0)形状是[1，28]
    return torch.cat(outputs, dim=0), (H,)


class RNNModelScratch:
    # 初始化模型参数
    def __init__(self, vocab_size, num_hiddens, device, get_params,
                 init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)


def predict_ch8(prefix, num_preds, net, vocab, device):
    """在 'prefix' 后面生成新字符。"""
    #
    state = net.begin_state(batch_size=1, device=device)
    outputs = [vocab[prefix[0]]]
    # 预测的时候一次只处理一个字符，[outputs[-1]]是最后一个元素的数据（不是索引），重构为形状(1,1)第一个维度是批量大小，第二个维度是索引数
    # lambda 函数被用来创建一个匿名函数，这个函数没有参数
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape(1, 1)

    for y in prefix[1:]:
        # state的形状为(1, num_hiddens)➡(1, 512)
        _, state = net(get_input(), state)
        outputs.append(vocab[y])
    # 预测十个字符：（num_preds）
    for _ in range(num_preds):
        # y的形状为(1, vocab_size)➡(1,28)
        y, state = net(get_input(), state)
        # outputs是存储整个生成序列的索引，y.argmax(dim=1)将返回每个样本中概率最高的词(这里是一个字符)的索引。
        outputs.append(int(y.argmax(dim=1).reshape(1)))
    return ''.join([vocab.idx_to_token[i] for i in outputs])


def grad_clipping(net, theta):
    """裁剪梯度"""
    if isinstance(net, nn.Module):  # 如果是用nn.Module的情况
        # 把它的参数列表提取出来在有梯度的情况下
        params = [p for p in net.parameters() if p.requires_grad]
    else:
        # 获取自定义模型的参数列表
        params = net.params
    # 把所有层的p，然后p的梯度平方求和，再开根号（等价于是说把所有的层的梯度拉成一个向量，把把这些向量全部拼在一起，再对该向量求范数）
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    # 如果梯度范数超过指定阈值 theta，将所有参数的梯度 × theta / norm
    if norm > theta:
        for param in params:
            # 将参数的梯度值裁剪至指定范围内，保持梯度范数不超过 theta
            param.grad[:] *= theta / norm


# @save
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
    """训练网络一个迭代周期（定义见第8章）"""
    state, timer = None, d2l.Timer()
    metric = d2l.Accumulator(2)  # 训练损失之和,词元数量
    for X, Y in train_iter:
        # state初始化
        if state is None or use_random_iter:
            # 在第一次迭代或使用随机抽样时初始化state
            state = net.begin_state(batch_size=X.shape[0], device=device)
        else:
            if isinstance(net, nn.Module) and not isinstance(state, tuple):
                # state对于nn.GRU是个张量
                # 分离隐藏状态的计算图（不把state里面的值改掉，而是说做backward的时候前面的计算图就detach掉了）
                state.detach_()
                # 读取新的iter后，将隐状态从计算图中分离出来，以避免不必要的梯度计算，从而提高效率和减少内存使用
            else:
                # state对于nn.LSTM或对于我们从零开始实现的模型是个元组，每个元素是一个张量，需要遍历每个元素
                for s in state:
                    s.detach_()
        # 把Y的形状从(批量大小,时间步数)转置为(时间步数,批量大小)然后重构为(时间步数*批量大小)
        y = Y.T.reshape(-1)
        X, y = X.to(device), y.to(device)
        # 使用输入序列和隐藏状态进行前向传播计算，得到预测值和更新后的隐藏状态
        y_hat, state = net(X, state)
        # y_hat的形状为(时间步数*批量大小)的样本
        l = loss(y_hat, y.long()).mean()
        if isinstance(updater, torch.optim.Optimizer):
            # 清空优化器中的梯度
            updater.zero_grad()
            l.backward()
            # 在参数更新前进行梯度裁剪
            grad_clipping(net, 1)
            updater.step()
        else:
            l.backward()
            grad_clipping(net, 1)
            # 执行自定义的参数更新函数
            updater(batch_size=1)
        # 累加损失和样本数量
        metric.add(l * y.numel(), y.numel())
    # 输出 困惑度=平均损失的指数形式（以 e 为底），运行速度（每秒样本处理速度）
    return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()


# @save
def train_ch8(net, train_iter, vocab, lr, num_epochs, device, use_random_iter=False):
    """训练模型（定义见第8章）"""
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
                            legend=['train'], xlim=[10, num_epochs])
    # 初始化
    if isinstance(net, nn.Module):
        updater = torch.optim.SGD(net.parameters(), lr)
    else:
        # 它基于给定的批量大小调用d2l.sgd函数来更新神经网络的参数。
        updater = lambda batch_size: d2l.sgd(net.params, lr, batch_size)
    # 定义一个预测函数，用于生成给定前缀之后的新字符序列
    predict = lambda prefix: predict_ch8(prefix, 50, net, vocab, device)
    # 训练和预测
    for epoch in range(num_epochs):
        ppl, speed = train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter)
        # 每隔 10 个迭代周期生成
        if (epoch + 1) % 10 == 0:
            # 打印以 'time traveller' 为前缀的新字符序列
            print(predict('time traveller'))
            # 将当前迭代周期的困惑度添加到动画中进行可视化
            animator.add(epoch + 1, [ppl])
    # 打印最终的困惑度和每秒样本处理速度
    print(f'困惑度 {ppl:.1f}, {speed:.1f} 词元/秒 {str(device)}')
    print(predict('time traveller'))
    print(predict('traveller'))


batch_size, num_steps = 32, 35
num_hiddens = 512
num_epochs, lr = 500, 1
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn)
train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu())
d2l.plt.show()

预测过程：（共50行）

time traveller the the the the the the the the the the the the t
time traveller the the the the the the the the the the the the t
time traveller the the the the the the the the the the the the t
time traveller and the the the the the the the the the the the t
time travellere and and and and and and and and and and and and 
time traveller and the the the the the the the the the the the t
time traveller and the the this the the the the the the the the 
time traveller and the the the the the the the the the the the t
time traveller and the the the the the the the the the the the t
time traveller and the the the the the the the the the the the t
time traveller and and the the that and and the the that and and
time traveller the this the this the this the this the this the 
time traveller and and and and and and and and and and and and a
time travellere athere are the that the enothe siont of the that
time travellere at fire wer cal meand the the thas ghas the and 
time traveller the andimensions at meathe that athe this the gra
time traveller the cines of the onge to the other this thes toun
time traveller a sexplane coners and mur all cand the pryche sim
time traveller pathere ic to sent a four wiokne some time travel
time traveller of the rime traveller of the ravellly bout in the
time traveller pat lasse thing in y i that loss the geime bsare 
time traveller than then thime time as ie mint and the thene thi
time travellerit noupse wo he save tous in time bly lempanced an
time travelleris ne iniend for mome time travelleris fourd chere
time traveller oul chand man losmedtand wishou urank that by con
time traveller cofce soime sist allithe timere abshree begrented
time traveller for this that spase time ar ar were attravexattir
time travellericknely i and in way a sat in so way urofit syis f
time traveller but now you begin to seethe object of my investig
time traveller proceeded any thatwer a comuraus that very yount 
time traveller follsoie thisnd so sinitarnt fofelyithan ubrict o
time traveller smiled aroug the notmare sorecurastilit fore ins 
time traveller for so it will be convenient to speak of himwas e
time traveller so d ascome roos move follighe that upen smave at
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time traveller fron in counslon mo wables that flashed andpassed
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time travelleryou can show black is white by argument said filby
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time traveller for so it will be convenient to speak of himwas e
time travelleryou can show black is white by argument said filby
time travelleryou can show black is white by argument said filby
time traveller with a slight accession ofcheerfulness really thi

在这里插入图片描述

使用随机抽样方法的效果。

import random
import math
import torch
from torch import nn
from d2l import torch as d2l
from torch.nn import functional as F


# 28,512
def get_params(vocab_size, num_hiddens, device):
    # 输入和输出都是28
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    W_xh = normal((num_inputs, num_hiddens))  # (vocab_size,num_hiddens)➡(28,512)
    W_hh = normal((num_hiddens, num_hiddens))  # (num_hiddens, num_hiddens)➡(512,512)
    b_h = torch.zeros(num_hiddens, device=device)  # (num_hiddens,)➡(512,)
    W_hq = normal((num_hiddens, num_outputs))  # (num_hiddens, vocab_size)➡(512, 28)
    b_q = torch.zeros(num_outputs, device=device)  # (vocab_size,)➡(28,)
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params


# 32，512
def init_rnn_state(batch_size, num_hiddens, device):
    # 生成一个32行512列的零矩阵
    return (torch.zeros((batch_size, num_hiddens), device=device),)


def rnn(inputs, state, params):
    W_xh, W_hh, b_h, W_hq, b_q = params
    # H的形状为(batch_size, num_hiddens)➡(32,512)
    H, = state
    outputs = []
    # X的形状(batch_size, vocab_size)➡(32，28)
    for X in inputs:
        #  (32,28)×(28,512)=（32，512）和(32,512)×(512,512)=（32，512）以及（512)相加的时候对b_h进行广播
        # 广播到形状 (batch_size, num_hiddens)，其中 b_h 中的每个元素都会被复制到 batch_size 次，以便与二维张量的每一行相加。
        # ∴H的形状为(batch_size, num_hiddens)➡(32,512)
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        # (32,512)×(512, 28)=(32,28)和(28,)
        # Y的形状为(batch_size, vocab_size)➡(32,28)
        Y = torch.mm(H, W_hq) + b_q
        # ⭐ outputs是一个列表不是一个张量，但每个元素的维度都是(batch_size, vocab_size)
        outputs.append(Y)
    # 按照输入顺序连接，输出的维度是(num_steps * batch_size, vocab_size)这里步数为1，批量大小也是1。这里的torch.cat(outputs, dim=0)形状是[1，28]
    return torch.cat(outputs, dim=0), (H,)


class RNNModelScratch:
    # 初始化模型参数
    def __init__(self, vocab_size, num_hiddens, device, get_params,
                 init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)


def predict_ch8(prefix, num_preds, net, vocab, device):
    """在 'prefix' 后面生成新字符。"""
    #
    state = net.begin_state(batch_size=1, device=device)
    outputs = [vocab[prefix[0]]]
    # 预测的时候一次只处理一个字符，[outputs[-1]]是最后一个元素的数据（不是索引），重构为形状(1,1)第一个维度是批量大小，第二个维度是索引数
    # lambda 函数被用来创建一个匿名函数，这个函数没有参数
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape(1, 1)

    for y in prefix[1:]:
        # state的形状为(1, num_hiddens)➡(1, 512)
        _, state = net(get_input(), state)
        outputs.append(vocab[y])
    # 预测十个字符：（num_preds）
    for _ in range(num_preds):
        # y的形状为(1, vocab_size)➡(1,28)
        y, state = net(get_input(), state)
        # outputs是存储整个生成序列的索引，y.argmax(dim=1)将返回每个样本中概率最高的词(这里是一个字符)的索引。
        outputs.append(int(y.argmax(dim=1).reshape(1)))
    return ''.join([vocab.idx_to_token[i] for i in outputs])


def grad_clipping(net, theta):
    """裁剪梯度"""
    if isinstance(net, nn.Module):  # 如果是用nn.Module的情况
        # 把它的参数列表提取出来在有梯度的情况下
        params = [p for p in net.parameters() if p.requires_grad]
    else:
        # 获取自定义模型的参数列表
        params = net.params
    # 把所有层的p，然后p的梯度平方求和，再开根号（等价于是说把所有的层的梯度拉成一个向量，把把这些向量全部拼在一起，再对该向量求范数）
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    # 如果梯度范数超过指定阈值 theta，将所有参数的梯度 × theta / norm
    if norm > theta:
        for param in params:
            # 将参数的梯度值裁剪至指定范围内，保持梯度范数不超过 theta
            param.grad[:] *= theta / norm


# @save
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
    """训练网络一个迭代周期（定义见第8章）"""
    state, timer = None, d2l.Timer()
    metric = d2l.Accumulator(2)  # 训练损失之和,词元数量
    for X, Y in train_iter:
        # state初始化
        if state is None or use_random_iter:
            # 在第一次迭代或使用随机抽样时初始化state
            state = net.begin_state(batch_size=X.shape[0], device=device)
        else:
            if isinstance(net, nn.Module) and not isinstance(state, tuple):
                # state对于nn.GRU是个张量
                # 分离隐藏状态的计算图（不把state里面的值改掉，而是说做backward的时候前面的计算图就detach掉了）
                state.detach_()
                # 读取新的iter后，将隐状态从计算图中分离出来，以避免不必要的梯度计算，从而提高效率和减少内存使用
            else:
                # state对于nn.LSTM或对于我们从零开始实现的模型是个元组，每个元素是一个张量，需要遍历每个元素
                for s in state:
                    s.detach_()
        # 把Y的形状从(批量大小,时间步数)转置为(时间步数,批量大小)然后重构为(时间步数*批量大小)
        y = Y.T.reshape(-1)
        X, y = X.to(device), y.to(device)
        # 使用输入序列和隐藏状态进行前向传播计算，得到预测值和更新后的隐藏状态
        y_hat, state = net(X, state)
        # y_hat的形状为(时间步数*批量大小)的样本
        l = loss(y_hat, y.long()).mean()
        if isinstance(updater, torch.optim.Optimizer):
            # 清空优化器中的梯度
            updater.zero_grad()
            l.backward()
            # 在参数更新前进行梯度裁剪
            grad_clipping(net, 1)
            updater.step()
        else:
            l.backward()
            grad_clipping(net, 1)
            # 执行自定义的参数更新函数
            updater(batch_size=1)
        # 累加损失和样本数量
        metric.add(l * y.numel(), y.numel())
    # 输出 困惑度=平均损失的指数形式（以 e 为底），运行速度（每秒样本处理速度）
    return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()


# @save
def train_ch8(net, train_iter, vocab, lr, num_epochs, device, use_random_iter=False):
    """训练模型（定义见第8章）"""
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
                            legend=['train'], xlim=[10, num_epochs])
    # 初始化
    if isinstance(net, nn.Module):
        updater = torch.optim.SGD(net.parameters(), lr)
    else:
        # 它基于给定的批量大小调用d2l.sgd函数来更新神经网络的参数。
        updater = lambda batch_size: d2l.sgd(net.params, lr, batch_size)
    # 定义一个预测函数，用于生成给定前缀之后的新字符序列
    predict = lambda prefix: predict_ch8(prefix, 50, net, vocab, device)
    # 训练和预测
    for epoch in range(num_epochs):
        ppl, speed = train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter)
        # 每隔 10 个迭代周期生成
        if (epoch + 1) % 10 == 0:
            # 打印以 'time traveller' 为前缀的新字符序列
            print(predict('time traveller'))
            # 将当前迭代周期的困惑度添加到动画中进行可视化
            animator.add(epoch + 1, [ppl])
    # 打印最终的困惑度和每秒样本处理速度
    print(f'困惑度 {ppl:.1f}, {speed:.1f} 词元/秒 {str(device)}')
    print(predict('time traveller'))
    print(predict('traveller'))


batch_size, num_steps = 32, 35
num_hiddens = 512
num_epochs, lr = 500, 1
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn)
train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu(), use_random_iter=True)
d2l.plt.show()

在这里插入图片描述

循环神经网络的简洁实现

加载数据集

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

定义模型

num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)

使用张量来初始化隐状态，它的形状是（隐藏层数，批量大小，隐藏单元数）。

state = torch.zeros((1, batch_size, num_hiddens))
print(state.shape)

在这里插入图片描述

通过一个隐状态和一个输入，我们就可以用更新后的隐状态计算输出。需要强调的是，rnn_layer的“输出”（即：Y）不涉及输出层的计算：它是指每个时间步的隐状态，这些隐状态可以用作后续输出层的输入。

X = torch.rand(size=(num_steps, batch_size, len(vocab)))
Y, state_new = rnn_layer(X, state)
print(Y.shape, state_new.shape)

定义RNNModel类

#@save
class RNNModel(nn.Module):
    """循环神经网络模型"""
    def __init__(self, rnn_layer, vocab_size, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        self.rnn = rnn_layer
        self.vocab_size = vocab_size
        self.num_hiddens = self.rnn.hidden_size
        # 如果RNN是双向的（之后将介绍），num_directions应该是2，否则应该是1
        if not self.rnn.bidirectional:
            self.num_directions = 1
            # 构造输出层
            self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
        else:
            self.num_directions = 2
            self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)
 
    def forward(self, inputs, state):
        X = F.one_hot(inputs.T.long(), self.vocab_size)
        X = X.to(torch.float32)
        Y, state = self.rnn(X, state)
        # 全连接层首先将Y的形状改为(时间步数*批量大小,隐藏单元数)
        # 它的输出形状是(时间步数*批量大小,词表大小)。
        output = self.linear(Y.reshape((-1, Y.shape[-1])))
        return output, state
 
    def begin_state(self, device, batch_size=1):
        if not isinstance(self.rnn, nn.LSTM):
            # nn.GRU以张量作为隐状态
            return  torch.zeros((self.num_directions * self.rnn.num_layers,
                                 batch_size, self.num_hiddens),
                                device=device)
        else:
            # nn.LSTM以元组作为隐状态
            return (torch.zeros((
                self.num_directions * self.rnn.num_layers,
                batch_size, self.num_hiddens), device=device),
                    torch.zeros((
                        self.num_directions * self.rnn.num_layers,
                        batch_size, self.num_hiddens), device=device))

训练与预测

基于一个具有随机权重的模型进行预测

device = d2l.try_gpu()
net = RNNModel(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
d2l.predict_ch8('time traveller', 10, net, vocab, device)

该部分总代码

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


# @save
class RNNModel(nn.Module):
    """循环神经网络模型"""

    def __init__(self, rnn_layer, vocab_size, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        self.rnn = rnn_layer
        self.vocab_size = vocab_size
        # 之前的RNN模型包括了输出层，但这里没有rnn_layer只包括那个隐藏层，没有包括输出层
        self.num_hiddens = self.rnn.hidden_size
        # 如果RNN是双向的，num_directions应该是2，否则应该是1
        if not self.rnn.bidirectional:
            self.num_directions = 1
            # 线性层的输入大小为隐藏状态大小，输出大小为词汇表大小
            # 构造输出层
            self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
        else:
            self.num_directions = 2
            self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)

    def forward(self, inputs, state):
        X = F.one_hot(inputs.T.long(), self.vocab_size)
        X = X.to(torch.float32)
        Y, state = self.rnn(X, state)
        # reshape做成一个2D，全连接层首先将Y的形状改为(时间步数*批量大小,隐藏单元数)
        # 它的输出形状是(时间步数*批量大小,词表大小)。
        output = self.linear(Y.reshape((-1, Y.shape[-1])))
        return output, state

    def begin_state(self, device, batch_size=1):
        if not isinstance(self.rnn, nn.LSTM):
            # nn.GRU以张量作为隐状态
            return torch.zeros((self.num_directions * self.rnn.num_layers,
                                batch_size, self.num_hiddens),
                               device=device)
        else:
            # nn.LSTM以元组作为隐状态
            return (torch.zeros((
                self.num_directions * self.rnn.num_layers,
                batch_size, self.num_hiddens), device=device),
                    torch.zeros((
                        self.num_directions * self.rnn.num_layers,
                        batch_size, self.num_hiddens), device=device))


batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)
device = d2l.try_gpu()
net = RNNModel(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
print(d2l.predict_ch8('time traveller', 10, net, vocab, device))

使用高级API训练模型

很明显，这种模型根本不能输出好的结果。然后使用定义的超参数调用train_ch8，并且使用高级API训练模型。

num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)

总代码

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


# @save
class RNNModel(nn.Module):
    """循环神经网络模型"""

    def __init__(self, rnn_layer, vocab_size, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        self.rnn = rnn_layer
        self.vocab_size = vocab_size
        # 之前的RNN模型包括了输出层，但这里没有rnn_layer只包括那个隐藏层，没有包括输出层
        self.num_hiddens = self.rnn.hidden_size
        # 如果RNN是双向的，num_directions应该是2，否则应该是1
        if not self.rnn.bidirectional:
            self.num_directions = 1
            # 线性层的输入大小为隐藏状态大小，输出大小为词汇表大小
            # 构造输出层
            self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
        else:
            self.num_directions = 2
            self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)

    def forward(self, inputs, state):
        X = F.one_hot(inputs.T.long(), self.vocab_size)
        X = X.to(torch.float32)
        Y, state = self.rnn(X, state)
        # reshape做成一个2D，全连接层首先将Y的形状改为(时间步数*批量大小,隐藏单元数)
        # 它的输出形状是(时间步数*批量大小,词表大小)。
        output = self.linear(Y.reshape((-1, Y.shape[-1])))
        return output, state

    def begin_state(self, device, batch_size=1):
        if not isinstance(self.rnn, nn.LSTM):
            # nn.GRU以张量作为隐状态
            return torch.zeros((self.num_directions * self.rnn.num_layers,
                                batch_size, self.num_hiddens),
                               device=device)
        else:
            # nn.LSTM以元组作为隐状态
            return (torch.zeros((
                self.num_directions * self.rnn.num_layers,
                batch_size, self.num_hiddens), device=device),
                    torch.zeros((
                        self.num_directions * self.rnn.num_layers,
                        batch_size, self.num_hiddens), device=device))


batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)
device = d2l.try_gpu()
net = RNNModel(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)
d2l.plt.show()