引言

上一节介绍了循环神经网络反向传播中存在的梯度消失问题，并以此为引介绍了长短期记忆神经网络 $(\text{Long-Short Term Memory,LSTM})$ 。本节将从反向传播角度观察为什么 $\text{LSTM}$ 能够抑制梯度消失的情况。

回顾加补充：通过时间反向传播

回顾上一节中针对 $\text{RNN}$ 中 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}}\end{aligned}$ 的反向传播过程：
RNN反向传播过程示例
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}} = (\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \left\{\prod_{k=1}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right]\right\} \cdot \left\{\prod_{k=2}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}}\right\} \cdot x^{(1)}$
如果仅仅是描述 $\mathcal T$ 时刻损失结果 $\mathcal L^{(\mathcal T)}$ 对权重分量 $\mathcal W_{x^{(1)} \Rightarrow h^{(1)}}$ 的梯度信息(红色路径)，使用上述公式即可；但实际上，该网络层是一个循环过程，我们需要求解 $\mathcal L^{(\mathcal T)}$ 对整个权重 $\mathcal W_{\mathcal X \Rightarrow \mathcal H}$ 的梯度进行求解。

这意味着：每更新到一个时刻 $t(t=1,2,\cdots,\mathcal T)$ ,都会将当前时刻 $\mathcal W_{x^{(t)}\Rightarrow h^{(t)}}$ 的梯度累加在 $\mathcal W_{\mathcal X \Rightarrow \mathcal H}$ 的梯度中。因而关于 $\mathcal L^{(\mathcal T)}$ 对 $\mathcal W_{\mathcal X \Rightarrow \mathcal H}$ 的梯度 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}}\end{aligned}$ 表示为如下形式：
将上式 $\Rightarrow t$ 代入。
下面公式中最后一项的大括号不是矩阵，而是为方便表达，描述 $\mathcal T$ 项累加和的过程。
$\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow h^{(\mathcal T)}}} + \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T-1)} \Rightarrow h^{(\mathcal T-1)}}} + \cdots + \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}} \\ & = \sum_{t=1}^{\mathcal T} \frac{\partial \mathcal L^{(t)}}{\partial \mathcal W_{x^{(t)} \Rightarrow h^{(t)}}} \\ & = \sum_{t=1}^{\mathcal T} \left[(\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \left\{\prod_{k=t}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right]\right\} \cdot \left\{\prod_{k=t+1}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}}\right\} \cdot x^{(t)}\right] \\ & = (\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \begin{Bmatrix} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(\mathcal T)})\right] \cdot x^{(\mathcal T)} \\ +\prod_{k=\mathcal T-1}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right] \cdot \mathcal W_{h^{(\mathcal T - 1)}\Rightarrow h^{(\mathcal T)}} \cdot x^{(\mathcal T - 1)} \\ \vdots \\ +\prod_{k=1}^{\mathcal T} \text{Diag}\left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right] \cdot \prod_{k=2}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}} \cdot x^{(1)} \end{Bmatrix} \end{aligned}$
很明显，上述公式中大括号内共包含 $\mathcal T$ 项的累加结果。可以通过观察发现：越后面的累加项，梯度消失的越厉害。也就是说：梯度 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}}\end{aligned}$ 的结果，主要贡献来源于其反向传播最开始的若干个时刻。而这种计算代价为 $\mathcal O(\mathcal T)$ 的反向传播算法也被称作通过时间反向传播 $(\text{Back-Propagation Through Time,BPTT})$ 。

$\text{LSTM}$ 的反向传播过程

场景构建

关于 $t$ 时刻 $\text{LSTM}$ 的前馈计算过程表示如下：
这里 $y^{(t)}$ 表示当前时刻的最终输出，后与对应时刻的损失函数结果 $\mathcal L^{(t)}$ 相衔接。
$\begin{aligned} & f^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow f^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)}\Rightarrow h^{(t)}} \cdot h^{(t-1)} + b_f\right] \\ & i^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow i^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} \cdot h^{(t-1)} + b_i\right] \\ & {\mathcal O}^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow {\mathcal O}^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow {\mathcal O}^{(t)}} \cdot h^{(t-1)} + b_{\mathcal O}\right] \\ & \widetilde{\mathcal C}^{(t)} = \text{Tanh} \left[\mathcal W_{x^{(t)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \cdot h^{(t-1)} + b_{\widetilde{\mathcal C}}\right] \\ & \mathcal C^{(t)} = \mathcal C^{(t-1)} * f^{(t)} + \widetilde{\mathcal C}^{(t)} * i^{(t)} \\ & h^{(t)} = {\mathcal O}^{(t)} * \text{Tanh}(\mathcal C^{(t)}) \\ & y^{(t)} = \mathcal W_{h^{(t)} \Rightarrow y^{(t)}} \cdot h^{(t)} + b_y \quad \mathcal W_{h^{(t)} \Rightarrow y^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal Y} \end{aligned}$
和上面的 $\text{BPTT}$ 思路相同，上述公式中的权重参数如 $\mathcal W_{x^{(t)} \Rightarrow f^{(t)}},\mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}}$ 等等，它们均是某时刻 $t$ 关于输入变量 $x^{(t)}$ 或者上一时刻隐变量 $h^{(t-1)}$ 在各个门结构中的权重信息。在反向传播过程中：每一时刻的梯度均会存放在对应的权重参数中。我们将其对应设定为：
$\begin{aligned} & \mathcal W_{\mathcal X \Rightarrow}: \begin{cases} \mathcal W_{x^{(t)} \Rightarrow f^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal F} \quad \mathcal W_{x^{(t)} \Rightarrow i^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal I} \\ \mathcal W_{x^{(t)} \Rightarrow {\mathcal O}^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal O} \quad \mathcal W_{x^{(t)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \widetilde{\mathcal C}} \end{cases} \\ & \mathcal W_{\mathcal H \Rightarrow}: \begin{cases} \mathcal W_{h^{(t-1)} \Rightarrow f^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal F} \quad \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal I} \\ \mathcal W_{h^{(t-1)} \Rightarrow {\mathcal O}^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal O} \quad \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \widetilde{\mathcal C}} \end{cases} \end{aligned}$

示例：求解梯度 $\begin{aligned}\frac{\mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned}$

假设序列长度 $\mathcal T$ ，并且 $\mathcal T$ 时刻输出的损失结果为 $\mathcal L^{(\mathcal T)}$ ，我们想要求解 $\mathcal L^{(\mathcal T)}$ 对权重矩阵 $\mathcal W_{\mathcal X \Rightarrow \mathcal F}$ 的梯度结果 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned}$ ：
和‘循环神经网络’逻辑相同，描述各时刻的梯度累加。
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}} = \sum_{t=1}^{\mathcal T} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(t)} \Rightarrow f^{(t)}}}$

反向传播过程 $\mathcal T$ 时刻梯度 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned}$ 求解

这里先观察最后一个时刻的梯度 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned}$ ：

它的梯度传播路径可表示为：
$\begin{aligned} \begin{cases} & \widetilde{f}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \underbrace{\mathcal W_{h^{(\mathcal T -1)} \Rightarrow f^{(\mathcal T)}}\cdot h^{(\mathcal T -1)} + b_f}_{梯度无关} \\ & f^{(\mathcal T)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \underbrace{\widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)}}_{梯度无关} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \end{cases} \end{aligned}$
对应传播路径图像表示为(红色箭头路径)：

因此，梯度 $\begin{aligned}\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned}\end{aligned}$ 可表示为：
$\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial m^{(\mathcal T)}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)}\Rightarrow f^{(\mathcal T)}}} \end{aligned}$

反向传播过程 $\mathcal T - 1$ 时刻梯度 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T -1)}}}\end{aligned}$ 求解

至此，距离 $\mathcal T$ 时刻最近的，关于 $\mathcal W_{\mathcal X \Rightarrow \mathcal F}$ 的的梯度信息已经求解出来。那么 $\mathcal T-1$ 时刻呢？和 $\mathcal T$ 时刻相比有什么区别呢 $?$ 损失结果 $\mathcal L^{(\mathcal T)}$ 关于 $\mathcal T-1$ 时刻 $\mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}$ 的梯度结果 $\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}} \end{aligned}$ 进行表示：
关于 $W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}$ 的梯度必然会经过 $\mathcal T$ 时刻，它的反向传播过程包含几类路径：

从输出门将 $h^{(\mathcal T)}$ 直接反向传播至 $h^{(\mathcal T - 1)}$ ，再从 $h^{(\mathcal T - 1)}$ 反向传播至 $W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}$ 。它的梯度传播路径可表示为：
省略号部分的与 $\mathcal T$ 时刻相同，仅需将对应的 $\mathcal T$ 改成 $\mathcal T-1$ 即可，下面省略号同理。
$\begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & {\mathcal O}^{(\mathcal T)} = \sigma \left[\underbrace{\mathcal W_{x^{(\mathcal T)} \Rightarrow {\mathcal O}^{(\mathcal T)}} \cdot x^{(\mathcal T)} + b_{\mathcal O}}_{h^{(\mathcal T-1)}无关} + \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow {\mathcal O}^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)}\right] \\ & h^{(\mathcal T - 1)} = {\mathcal O}^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases}$
因此，该路径的梯度可表示为：
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial \mathcal O^{(\mathcal T)}} \cdot \frac{\partial \mathcal O^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T-1)}}{\partial m^{(\mathcal T-1)}} \cdot \frac{\partial m^{(\mathcal T-1)}}{\partial \mathcal C^{(\mathcal T-1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}\right]$
从细胞状态 $\mathcal C^{(\mathcal T)}$ 角度反向传播至 $\mathcal C^{(\mathcal T -1)}$ ，再从 $\mathcal C^{(\mathcal T - 1)}$ 反向传播至 $f^{(\mathcal T-1)}$ ，直至 $W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}$ 。它的梯度传播路径可表示为：
$\begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & \mathcal C^{(\mathcal T-1)} = \mathcal C^{(\mathcal T -2)} * f^{(\mathcal T-1)} + \widetilde{\mathcal C}^{(\mathcal T-1)} * i^{(\mathcal T-1)} \\ & f^{(\mathcal T-1)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T-1)}) \\ & \widetilde{f}^{(\mathcal T-1)} = \mathcal W_{x^{(\mathcal T-1)} \Rightarrow f^{(\mathcal T-1)}} \cdot x^{(\mathcal T-1)} + \underbrace{\mathcal W_{h^{(\mathcal T -2)} \Rightarrow f^{(\mathcal T-1)}}\cdot h^{(\mathcal T -2)} + b_f}_{梯度无关} \end{cases} \end{aligned}$
因此，该路径的梯度可表示为：
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{ \partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial f^{(\mathcal T - 1)}} \cdot \frac{\partial f^{(\mathcal T - 1)}}{\partial \widetilde{f}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}\right]$
在细胞状态 $\mathcal C^{(\mathcal T)}$ 的基础上，通过遗忘门反向传播至 $h^{(\mathcal T - 1)}$ ，后续与第一种情况相同。它的传播路径可表示为：
$\begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & f^{(\mathcal T)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T)}) \\ & \widetilde{f}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T -1)} \Rightarrow f^{(\mathcal T)}}\cdot h^{(\mathcal T -1)} + b_f \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned}$
因此，该路径的梯度可表示为：
从 $h^{(\mathcal T - 1)}$ 到 $\mathcal W_{x^{(\mathcal T-1)} \Rightarrow f^{(\mathcal T - 1)}}$ 的梯度路径是固定的，见情况 $1$ (后 $5$ 个梯度),这里使用省略号表示，下同。
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial h^{(\mathcal T-1)}} \cdots\right]$
在细胞状态 $\mathcal C^{(\mathcal T)}$ 的基础上，通过输入门反向传播至 $h^{(\mathcal T - 1)}$ ，后续与第一种情况相同。它的传播路径可表示为：
新出现的符号： $\widetilde{i}^{(\mathcal T)}$ 表示输入门的线性计算过程。
$\begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & i^{(\mathcal T)} = \text{Sigmoid}(\widetilde{i}^{(\mathcal T)}) \\ & \widetilde{i}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)}\Rightarrow i^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T - 1)}\Rightarrow i^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)} + b_i \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned}$
对应路径梯度可表示为：
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial i^{(\mathcal T)}} \cdot \frac{\partial i^{(\mathcal T)}}{\partial \widetilde{i}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots\right]$
在细胞状态 $\mathcal C^{(\mathcal T)}$ 的基础上，通过候选状态 $\widetilde{\mathcal C}^{(\mathcal T)}$ 反向传播至 $h^{(\mathcal T - 1)}$ ，后续与第一种情况相同。它的传播路径可表示为：
$\begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & \widetilde{\mathcal C}^{(\mathcal T)} = \text{Tanh} \left[\mathcal W_{x^{(\mathcal T)} \Rightarrow \widetilde{\mathcal C}^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \widetilde{\mathcal C}^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)} + b_{\widetilde{\mathcal C}}\right] \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned}$
对应路径的梯度可表示为：
$\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \widetilde{\mathcal C}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots\right]$

至此，我们将所有关于 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}}\end{aligned}$ 的全部梯度路径查找完毕。将这些梯度结果进行累加：
其中大括号内的所有部分同上，仅表示各项的累加结果，并非矩阵;其中有 $4$ 条路径是从细胞状态 $\mathcal C^{(\mathcal T)}$ 得到。将其进行简写。
$\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \begin{Bmatrix} \frac{\partial h^{(\mathcal T)}}{\partial \mathcal O^{(\mathcal T)}} \cdot \frac{\partial \mathcal O^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \underbrace{\frac{\partial h^{(\mathcal T-1)}}{\partial m^{(\mathcal T-1)}} \cdot \frac{\partial m^{(\mathcal T-1)}}{\partial \mathcal C^{(\mathcal T-1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}}_{\cdots} \\ \quad \\ +\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \begin{bmatrix} \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial f^{(\mathcal T - 1)}} \cdot \frac{\partial f^{(\mathcal T - 1)}}{\partial \widetilde{f}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}} \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial h^{(\mathcal T-1)}} \cdots \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial i^{(\mathcal T)}} \cdot \frac{\partial i^{(\mathcal T)}}{\partial \widetilde{i}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial \widetilde{\mathcal C}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots \end{bmatrix} \end{Bmatrix} \end{aligned}$

$\mathcal T - 2$ 时刻与 $\mathcal T - 1$ 时刻关于 $\mathcal W_{x^{(t)} \Rightarrow f^{(t)}}$ 梯度的比较

$\mathcal T - 2$ 时刻的梯度结果 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T-2)} \Rightarrow f^{(\mathcal T - 2)}}}\end{aligned}$ 是否与 $\mathcal T - 1$ 时刻的情况相同呢？不相同。原因在于： $\mathcal T \Rightarrow \mathcal T - 1$ 时刻仅包含 $h^{(\mathcal T)}$ 的相关路径，也就是说，它均是从 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}}\end{aligned}$ 执行下来的。但是： $\mathcal T- 1 \Rightarrow \mathcal T -2$ 时刻不仅存在 $h^{(\mathcal T - 1)}$ 的相关路径，并且还包含 $\mathcal C^{(\mathcal T - 1)}$ 的相关路径：
其中关于 $h^{(\mathcal T - 1)}$ 的相关路径与 $h^{(\mathcal T)}$ 相同，不再赘述;与 $\mathcal C^{(\mathcal T-1)}$ 的相关路径存在如下几种形式。可以看出它们之间确实存在重合的部分，但需要分开进行梯度计算。因为 $\mathcal C^{(\mathcal T-1)}$ 和 $h^{(\mathcal T -1)}$ 不是一个东西。
$\mathcal C^{(\mathcal T -1)} \Rightarrow \begin{cases} \begin{aligned} & \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial \mathcal C^{(\mathcal T-2)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 2)}}{\partial f^{(\mathcal T - 2)}} \cdot \frac{\partial f^{(\mathcal T - 2)}}{\partial \widetilde{f}^{(\mathcal T - 2)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-2)}}{\partial \mathcal W_{x^{(\mathcal T-2)}\Rightarrow f^{(\mathcal T-2)}}} \\ & \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial h^{(\mathcal T-2)}} \cdots \\ &\frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial i^{(\mathcal T-1)}} \cdot \frac{\partial i^{(\mathcal T-1)}}{\partial \widetilde{i}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T-1)}}{\partial h^{(\mathcal T - 2)}} \cdots \\ & \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial \widetilde{\mathcal C}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T-1)}}{\partial h^{(\mathcal T - 2)}} \cdots \end{aligned} \end{cases}$
最终， $\mathcal T \Rightarrow \mathcal T-2$ 一共包含 $\times 5 + 4 = 24$ 条路经。
这个 $+ 4$ 是指输出门路径，因为该路径没有经过‘细胞状态’ $\mathcal C^{(t)}$ ,因此每一次达到 $h^{(\mathcal T)},h^{(\mathcal T - 1)}$ 时，它仅存在唯一一条路径向对应的 $h^{(\mathcal T - 1)},h^{(\mathcal T - 2)}$ 传播。

为什么 $\text{LSTM}$ 能够抑制梯度消失

随着反向传播深度的增加，反向传播路径的数量呈指数级别增长。即便可能出现梯度消失，也可以从数量的角度进行补充；
例如：关于细胞状态的梯度 $\begin{aligned}\frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned}$ 可以通过各门结构权重参数进行调节：
$\begin{aligned} \frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}} & = f^{(t)} + \begin{pmatrix} \begin{aligned} & \quad \frac{\partial \mathcal C^{(t)}}{\partial f^{(t)}} \cdot \frac{\partial f^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \\ &+ \frac{\partial \mathcal C^{(t)}}{\partial i^{(t)}} \cdot \frac{\partial i^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \\ &+ \frac{\partial \mathcal C^{(t)}}{\partial \widetilde{\mathcal C}^{(t)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \end{aligned} \end{pmatrix} \\ & = f^{(t)} + \begin{pmatrix} \mathcal C^{(t-1)} \cdot \left\{\left[\text{Sigmoid}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow f^{(t)}}\right\} \cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\} \\ +\widetilde{\mathcal C}^{(t)} \cdot \left\{\left[\text{Sigmoid}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}}\right\} \cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\} \\ +i^{(t)} \cdot \left\{\left[\text{Tanh}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}}\right\}\cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\}\\ \end{pmatrix} \end{aligned}$
可以发现：每向前反向传播一个梯度，都回出现 $4$ 项偏导伴随着该时刻梯度的出现，并且其中三项是由当前时刻遗忘门、输入门、输出门的权重参数相互调节决定的。

可以理解为：

整个反向传播过程中，所有时刻门结构的权重均参与到了 $\begin{aligned}\frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned}$ 的调节中，相比于循环神经网络中仅有一个权重矩阵的描述，它的鲁棒性会强很多；
并且循环神经网络中的权重矩阵是纯纯的累积，而 $\text{LSTM}$ 是各项累加，即便是其中一个时刻某门结构梯度消失，剩余门结构也会做出相应调整，来维持当前时刻梯度。