Lecture 8 Deep Learning for NLP: Recurrent Networks

news2025/7/9 8:35:17

Recurrent Networks 循环神经网络

Allow representation of arbitrarily sized inputs 允许表示任意大小的输入
Core idea: processes the input sequence one at a time, by applying a recurrence formula 核心思想：一次处理一个输入序列，通过应用递归公式
Uses a state vector to represent contexts that have been previously processed 使用状态向量表示之前处理过的上下文
RNN Neuron: RNN 神经元
RNN States: RNN 状态

Activation 激活函数:
RNN Unrolled: 展开的 RNN
- Same parameters $(W_s, W_x, b and W_y)$ are used across all time steps 同一参数 $(W_s, W_x, b and W_y)$ 在所有时间步长中都被使用
Training RNN: 训练 RNN
- An unrolled RNN is a very deep neural network. But parameters are shared across all time steps 展开的 RNN 是一个非常深的神经网络。但是参数在所有时间步中都是共享的
- To train RNN, just need to create the unrolled computation graph given an input sequence and use backpropagation algorithm to compute gradients as usual. 要训练 RNN，只需根据输入序列创建展开的计算图，并使用反向传播算法计算梯度
- This procedure is called backpropagation through time. 这个过程叫做时间反向传播
  
  E.g of unrolled equation: 展开方程的例子

在这里插入图片描述

$x_i$ is current word (e.g. eats) mapped to an embedding $x_i$ 是当前词（例如 eats）映射到一个嵌入
$s_{i-1}$ contains information of the previous words (e.g. a and cow) $s_{i-1}$ 包含前面词的信息（例如 a 和 cow）
$y_i$ is the next word (e.g. grass) $y_i$ 是下一个词（例如 grass）
Training:
- Vocabulary 词汇: [a, cow, eats, grass]
- Training example 训练样本: a cow eats grass
- Training process 训练过程:
  
  $s_i = tanh(W_ss{i-1} + W_xx_i + b)$
  
  $y_i = softmax(W_ys_i)$
- Losses:
  - $L_1 = -logP(0.30)$
  - $L_2 = -logP(0.50)$
  - $L_3 = -logP(0.20)$
  - Total loss: $L_{total} = L_1 + L_2 + L_3$
Generation:
Problems of RNN: RNN 的问题
- Error Propagation: Unable to recover from errors in intermediate steps 错误传播：无法从中间步骤的错误中恢复
- Low diversity in generated language 生成的语言多样性低
- Tend to generate bland or generic language 倾向于生成乏味或通用的语言

RNN has the capability to model infinite context. But it cannot capture long-range dependencies in practice due to the vanishing gradients RNN 具有建模无限上下文的能力。但由于梯度消失，实际上无法捕捉长距离依赖性
Vanishing Gradient: Gradients in later steps diminish quickly during backpropagation. Earlier inputs do not get much update. 梯度消失：在反向传播过程中，后续步骤的梯度快速减小。较早的输入没有得到太多更新。
LSTM is introduced to solve vanishing gradients LSTM 用来解决梯度消失问题
Core idea: have memory cells that preserve gradients across time. Access to the memory cells is controlled by gates. 核心思想：拥有跨时间保存梯度的记忆单元。通过门控制对记忆单元的访问。
Gates: For each input, a gate decides: 门：对于每个输入，门决定
- How much the new input should be written to the memory cell 应该将多少新输入写入记忆单元
- How much content of the current memory cell should be forgotten 应该忘记当前记忆单元的多少内容
Comparison between simple RNN and LSTM: 简单 RNN 和 LSTM 的比较

A gate $g$ is a vector. Each element of the gate has values between 0 and 1. Use sigmoid function to produce $g$ . 门 $g$ 是一个向量。门的每个元素的值在 0 到 1 之间。使用 sigmoid 函数来产生 $g$ 。
$g$ is multiplied component-wise with vector $v$ to determine how much information to keep for $v$ $g$ 和向量 $v$ 乘以 component-wise 来确定对 $v$ 保留多少信息

在这里插入图片描述

Controls how much information to forget in the memory cell $C_{t-1}$ 控制在记忆单元 $C_{t-1}$ 中忘记多少信息
E.g. Given Tha cas that the boy predict the next word likes 例如，给定 Tha cas that the boy 预测下一个词 likes
- Memory cell was storing noun information cats 记忆单元正在存储名词信息 cats
- The cell should now forget cats and store boy to correctly predict the singular verb likes 该单元现在应该忘记 cats 并存储 boy 以正确预测单数动词 likes

在这里插入图片描述

Input gate controls how much new information to put to memory cell 输入门控制将多少新信息放入记忆单元
$\widetilde{C}$ is new distilled information to be added $\widetilde{C}$ 是要添加的新提炼信息

在这里插入图片描述

在这里插入图片描述

Output gate controls how much to distill the content of the memory cell to create the next state 输出门控制如何提炼记忆单元的内容以创建下一个状态

Shakespeare Generator 莎士比亚生成器:
- Training data: all works fo Shakespeare 训练数据：莎士比亚的所有作品
- Model: Character RNN, hidden dimension = 512 模型：Character RNN，隐藏维度 = 512
Wikipedia Generator: 维基百科生成器
- Training data: 100MB of Wikipedia raw data 训练数据：100MB的维基百科原始数据
Code Generator 代码生成器
Text Classification 文本分类
- RNNs can be used in variety NLP tasks. Particularly suited for tasks where order of words matter. E.g. sentiment analysis RNNs可以用于各种NLP任务。特别适合于单词顺序很重要的任务。例如，情感分析
Sequence Labeling: E.g. POS tagging 序列标记：例如，词性标注

Peephole connections: allow gates to look at cell state 窥视孔连接：允许门看到单元状态

$f_t = \sigma(W_f[C_{t-1}, h_{t-1}, x_t] + b_f)$
Gated recurrent unit (GRU): Simplified variant with only 2 gates and no memory cell 门控循环单元（GRU）：简化的变体，只有2个门，没有记忆单元
Multi-layer LSTM 多层LSTM
Bidirectional LSTM 双向LSTM