引言

上一节介绍了门控循环单元 $(\text{Gate Recurrent Unit,GRU})$ ，本节我们参照 $\text{LSTM}$ 反向传播的格式，观察 $\text{GRU}$ 的反向传播过程。

回顾： $\text{GRU}$ 的前馈计算过程

GRU算法展开图
$\text{GRU}$ 的前馈计算过程表示如下：
为了后续的反向传播过程，将过程分解的细致一些。其中 $\widetilde{\mathcal Z}^{(t)},\widetilde{r}^{(t)}$ 分别表示更新门、重置门的线性计算过程。
$\begin{cases} \begin{aligned} & \widetilde{\mathcal Z}^{(t)} = \mathcal W_{\mathcal H \Rightarrow \mathcal Z} \cdot h^{(t-1)} + \mathcal W_{\mathcal X \Rightarrow \mathcal Z} \cdot x^{(t)} + b_{\mathcal Z} \\ & \mathcal Z^{(t)} = \sigma(\widetilde{\mathcal Z}^{(t)}) \\ & \widetilde{r}^{(t)} = \mathcal W_{\mathcal H \Rightarrow r} \cdot h^{(t-1)} + \mathcal W_{\mathcal X \Rightarrow r} \cdot x^{(t)} + b_{r} \\ & r^{(t)} = \sigma(\widetilde{r}^{(t)}) \\ & \widetilde{h}^{(t)} = \text{Tanh} \left[\mathcal W_{\mathcal H \Rightarrow \widetilde{\mathcal H}} \cdot (r^{(t)} * h^{(t-1)}) + \mathcal W_{\mathcal X \Rightarrow \widetilde{\mathcal H}} \cdot x^{(t)} + b_{\widetilde{\mathcal H}}\right] \\ & h^{(t)} = (1 -\mathcal Z^{(t)}) * h^{(t-1)} + \mathcal Z^{(t)} * \widetilde{h}^{(t)} \end{aligned} \end{cases}$

场景设计

上述仅描述的是 $\text{GRU}$ 关于序列信息 $h^{(t)}(t=1,2,\cdots,\mathcal T)$ 的迭代过程。各时刻的输出特征以及损失函数于循环神经网络相同：

使用 $\text{Softmax}$ 激活函数，其输出结果作为模型对 $t$ 时刻的预测结果：
$\begin{cases} \mathcal C^{(t)} = \mathcal W_{\mathcal H \Rightarrow \mathcal C} \cdot h^{(t)} + b_{h} \\ \hat y^{(t)} = \text{Softmax}(\mathcal C^{(t)}) \end{cases}$
关于 $t$ 时刻预测结果 $\hat y^{(t)}$ 与真实分布 $y^{(t)}$ 之间的偏差信息使用交叉熵 $(\text{CrossEntropy})$ 进行表示：
其中 $n_{\mathcal Y}$ 表示预测/真实分布的维数。
$\mathcal L^{(t)} = \mathcal L \left[\hat y^{(t)},y^{(t)}\right] = - \sum_{j=1}^{n_\mathcal Y} y_j^{(t)} \log \left[\hat y_j^{(t)}\right]$
所有时刻交叉熵结果的累加和构成完整的损失函数 $\mathcal L$ ：
$\begin{aligned} \mathcal L & = \sum_{t=1}^{\mathcal T} \mathcal L^{(t)}\\ & = \sum_{t=1}^{\mathcal T} \mathcal L \left[\hat y^{(t)},y^{(t)}\right] \end{aligned}$

反向传播过程

$\mathcal T$ 时刻的反向传播过程

以 $\mathcal T$ 时刻重置门 $\begin{aligned}\frac{\partial \mathcal L}{\partial \mathcal W_{\mathcal h^{(\mathcal T)} \Rightarrow \mathcal Z^{(\mathcal T)}}}\end{aligned}$ 的反向传播为例：

计算梯度 $\begin{aligned}\frac{\partial \mathcal L}{\partial \mathcal L^{(\mathcal T)}}\end{aligned}$ ：
其中仅有 $\mathcal L^{(\mathcal T)}$ 一项存在梯度，其余项均视作常数。
$\frac{\partial \mathcal L}{\partial \mathcal L^{(\mathcal T)}} = \frac{\partial}{\partial \mathcal L^{(\mathcal T)}} \left[\sum_{t=1}^{\mathcal T} \mathcal L^{(t)}\right] = 0 + 0 + \cdots + 1 = 1$
计算梯度 $\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}}\end{aligned}$ ：
关于 $\text{Softmax}$ 激活函数与交叉熵组合的梯度描述，见循环神经网络—— $\text{Softmax}$ 函数的反向传播过程一节，这里不再赘述。
$\begin{aligned} & \begin{cases} \mathcal L^{(\mathcal T)} = -\sum_{j=1}^{n_{\mathcal Y}}y_j^{(\mathcal T)} \log \left[\hat y_j^{(\mathcal T)}\right] \\ \hat y^{(\mathcal T)} = \text{Softmax}[\mathcal C^{(\mathcal T)}] \end{cases} \\ & \Rightarrow \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} = \hat y^{(\mathcal T)} - y^{(\mathcal T)} \end{aligned}$
继续计算梯度 $\begin{aligned}\frac{\partial \mathcal C^{(\mathcal T)}}{\partial h^{(\mathcal T)}}\end{aligned}$ ：
$\frac{\partial \mathcal C^{(\mathcal T)}}{\partial h^{(\mathcal T)}} = \frac{\partial}{\partial h^{(\mathcal T)}} \left[\mathcal W_{\mathcal H \Rightarrow \mathcal C} \cdot h^{(\mathcal T)} + b_{h}\right] = \mathcal W_{\mathcal H \Rightarrow \mathcal C}$

至此，关于梯度 $\begin{aligned}\frac{\partial \mathcal L}{\partial h^{(\mathcal T)}}\end{aligned}$ 可表示为：
$\begin{aligned} \frac{\partial \mathcal L}{\partial h^{(\mathcal T)}} & = \frac{\partial \mathcal L}{\partial \mathcal L^{(\mathcal T)}} \cdot \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \\ & = 1 \cdot \left[\mathcal W_{\mathcal H \Rightarrow \mathcal C}\right]^T \cdot (\hat y^{(\mathcal T)} - y^{(\mathcal T)}) \end{aligned}$
观察：从 $h^{(\mathcal T)}$ 开始，从 $h^{(\mathcal T)} \Rightarrow \mathcal W_{h^{(\mathcal T)} \Rightarrow \mathcal Z^{(\mathcal T)}}$ 的传播路径都有哪些。
只有唯一一条，其前馈计算路径表示为：
$\begin{cases} \begin{aligned} & h^{(t)} = (1 -\mathcal Z^{(t)}) * h^{(t-1)} + \mathcal Z^{(t)} * \widetilde{h}^{(t)} \\ & \mathcal Z^{(t)} = \sigma(\widetilde{\mathcal Z}^{(t)}) \\ & \widetilde{\mathcal Z}^{(t)} = \mathcal W_{\mathcal H \Rightarrow \mathcal Z} \cdot h^{(t-1)} + \mathcal W_{\mathcal X \Rightarrow \mathcal Z} \cdot x^{(t)} + b_{\mathcal Z} \end{aligned} \end{cases}$
对应反向传播结果表示为：
$\begin{aligned} \frac{\partial h^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T)} \Rightarrow \mathcal Z^{(\mathcal T)}}} & = \frac{\partial h^{(\mathcal T)}}{\partial \mathcal Z^{(\mathcal T)}} \cdot \frac{\partial \mathcal Z^{(\mathcal T)}}{\partial \widetilde{\mathcal Z}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal Z}^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T)} \Rightarrow \mathcal Z^{(\mathcal T)}}} \\ & = \left[\widetilde{h}^{(\mathcal T)} - h^{(\mathcal T - 1)}\right] \cdot \left[\text{Sigmoid}(\widetilde{\mathcal Z}^{(\mathcal T)})\right]' \cdot h^{(\mathcal T - 1)} \end{aligned}$
最终，关于 $\begin{aligned}\frac{\partial \mathcal L}{\partial \mathcal W_{\mathcal h^{(\mathcal T)} \Rightarrow \mathcal Z^{(\mathcal T)}}}\end{aligned}$ 的反向传播结果为：
这里更主要的是描述它的反向传播路径，它的具体展开在后续不再赘述。
$\begin{aligned} \begin{aligned}\frac{\partial \mathcal L}{\partial \mathcal W_{\mathcal h^{(\mathcal T)} \Rightarrow \mathcal Z^{(\mathcal T)}}}\end{aligned} & = \frac{\partial \mathcal L}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T)}\Rightarrow \mathcal Z^{(\mathcal T)}}} \\ & = \left\{\left[\mathcal W_{\mathcal H \Rightarrow \mathcal C}\right]^T \cdot (\hat y^{(\mathcal T)} - y^{(\mathcal T)})\right\} \cdot \left\{ \left[\widetilde{h}^{(\mathcal T)} - h^{(\mathcal T - 1)}\right] \cdot \left[\text{Sigmoid}(\widetilde{\mathcal Z}^{(\mathcal T)})\right]' \cdot h^{(\mathcal T - 1)}\right\} \end{aligned}$

$\mathcal T - 1$ 时刻的反向传播路径

关于 $\mathcal T - 1$ 时刻的重置门梯度 $\begin{aligned} \frac{\partial \mathcal L}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \end{aligned}$ ，它的路径主要包含两大类：

第一类路径：同 $\mathcal T$ 时刻路径，从对应的 $\mathcal L^{(\mathcal T - 1)}$ 直接传至 $\mathcal W_{h^{(\mathcal T - 1)}\Rightarrow \mathcal Z^{(\mathcal T - 1)}}$ ：
该路径与上述 $\mathcal T$ 时刻的路径类型相同，将对应的上标 $\mathcal T$ 改为 $\mathcal T - 1$ 即可。
$\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} & = \frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \end{aligned}$
第二类路径：重新观察 $\mathcal W_{h^{(\mathcal T - 1)}\Rightarrow \mathcal Z^{(\mathcal T - 1)}}$ 只会出现在 $\mathcal Z^{(\mathcal T - 1)}$ 中，并且 $\mathcal Z^{(\mathcal T - 1)}$ 只会出现在 $h^{(\mathcal T - 1)}$ 中。因此：仅需要找出与 $h^{(\mathcal T - 1)}$ 相关的所有路径即可，最终都可以使用 $\begin{aligned}\frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}}\end{aligned}$ 将梯度传递给 $\mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}$ 。
其中‘第一类路径’就是其中一种情况。只不过它是从当前 $\mathcal T - 1$ 时刻直接传递得到的梯度结果。而第二类路径我们关注从 $\mathcal T$ 时刻传递产生的梯度信息。
从 $\mathcal T \Rightarrow \mathcal T - 1$ 时刻中，关于 $h^{(\mathcal T - 1)}$ 的梯度路径一共包含 $4$ 条：

第一条：通过 $\mathcal T$ 时刻 $h^{(\mathcal T)}$ 中的 $h^{(\mathcal T - 1)}$ 进行传递。
$\begin{cases} \text{Forword : } h^{(\mathcal T)} = (1 -\mathcal Z^{(\mathcal T)}) * h^{(\mathcal T -1)} + \mathcal Z^{(\mathcal T)} * \widetilde{h}^{(\mathcal T)} \\ \quad \\ \text{Backward : }\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \Rightarrow \frac{\partial \mathcal L^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \end{aligned} \end{cases}$
第二条：通过 $\mathcal T$ 时刻 $h^{(\mathcal T)}$ 中的 $\mathcal Z^{(\mathcal T)}$ 向 $h^{(\mathcal T-1)}$ 进行传递。
$\begin{cases} \text{Forward : } \begin{cases} h^{(\mathcal T)} = (1 -\mathcal Z^{(\mathcal T)}) * h^{(\mathcal T -1)} + \mathcal Z^{(\mathcal T)} * \widetilde{h}^{(\mathcal T)} \\ \mathcal Z^{(\mathcal T)} = \sigma \left[\mathcal W_{\mathcal H \Rightarrow \mathcal Z} \cdot h^{(\mathcal T -1)} + \mathcal W_{\mathcal X \Rightarrow \mathcal Z} \cdot x^{(\mathcal T)} + b_{\mathcal Z}\right] \end{cases} \quad \\ \text{Backward : } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \Rightarrow \frac{\partial \mathcal L^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial \mathcal Z^{(\mathcal T)}} \cdot \frac{\partial \mathcal Z^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \end{aligned} \end{cases}$
第三条：通过 $\mathcal T$ 时刻 $h^{(\mathcal T)}$ 中的 $\widetilde{h}^{(\mathcal T)}$ 向 $h^{(\mathcal T - 1)}$ 进行传递。
$\begin{cases} \text{Forward : } \begin{cases} h^{(\mathcal T)} = (1 -\mathcal Z^{(\mathcal T)}) * h^{(\mathcal T -1)} + \mathcal Z^{(\mathcal T)} * \widetilde{h}^{(\mathcal T)} \\ \widetilde{h}^{(\mathcal T)} = \text{Tanh} \left[\mathcal W_{\mathcal H \Rightarrow \widetilde{\mathcal H}} \cdot (r^{(\mathcal T)} * h^{(\mathcal T -1)}) + \mathcal W_{\mathcal X \Rightarrow \widetilde{\mathcal H}} \cdot x^{(\mathcal T)} + b_{\widetilde{\mathcal H}}\right] \end{cases}\\ \quad \\ \text{Backward : } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \Rightarrow \frac{\partial \mathcal L^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial \widetilde{h}^{(\mathcal T)}} \cdot \frac{\widetilde{h}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \end{aligned} \end{cases}$
第四条：与第三条路径类似，只不过从 $\widetilde{h}^{(\mathcal T)}$ 中的 $r^{(\mathcal T)}$ 向 $h^{(\mathcal T - 1)}$ 进行传递。
$\begin{cases} \text{Forward : } \begin{cases} h^{(\mathcal T)} = (1 -\mathcal Z^{(\mathcal T)}) * h^{(\mathcal T -1)} + \mathcal Z^{(\mathcal T)} * \widetilde{h}^{(\mathcal T)} \\ \widetilde{h}^{(\mathcal T)} = \text{Tanh} \left[\mathcal W_{\mathcal H \Rightarrow \widetilde{\mathcal H}} \cdot (r^{(\mathcal T)} * h^{(\mathcal T -1)}) + \mathcal W_{\mathcal X \Rightarrow \widetilde{\mathcal H}} \cdot x^{(\mathcal T)} + b_{\widetilde{\mathcal H}}\right] \\ \widetilde{r}^{(\mathcal T)} = \mathcal W_{\mathcal H \Rightarrow r} \cdot h^{(\mathcal T -1)} + \mathcal W_{\mathcal X \Rightarrow r} \cdot x^{(\mathcal T)} + b_{r} \end{cases}\\ \quad \\ \text{Backward : } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \Rightarrow \frac{\partial \mathcal L^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial \widetilde{h}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{h}^{(\mathcal T)}}{\partial r^{(\mathcal T)}} \cdot \frac{\partial r^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \mathcal Z^{(\mathcal T - 1)}}} \end{aligned} \end{cases}$

至此， $\mathcal T \Rightarrow \mathcal T - 1$ 时刻的 $5$ 条路径已全部找全。其中：

$1$ 条是 $\mathcal T - 1$ 时刻自身路径；
剩余 $4$ 条均是 $\mathcal T \Rightarrow \mathcal T - 1$ 的传播路径。

$\mathcal T - 2$ 时刻的反向传播路径

再往下走一步，观察它路径传播数量的规律：

第 $1$ 条依然是 $\mathcal L^{(\mathcal T - 2)}$ 向 $\mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}$ 直接传递的梯度：
$\begin{aligned} \frac{\partial \mathcal L^{(\mathcal T - 2)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}} = \frac{\partial \mathcal L^{(\mathcal T - 2)}}{\partial h^{(\mathcal T - 2)}} \cdot \frac{\partial h^{(\mathcal T - 2)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}} \end{aligned}$
存在 $4$ 条是从 $\mathcal L^{(\mathcal T - 1)}$ 开始，从 $\mathcal T - 1 \Rightarrow \mathcal T - 2$ 时刻传递的路径：
$\frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}} = \begin{cases} \begin{aligned} & \frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 2)}} \cdot \frac{\partial h^{(\mathcal T - 2)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}} \\ & \frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \mathcal Z^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal Z^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 2)}} \cdot \frac{\partial h^{(\mathcal T - 2)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}}\\ & \frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \widetilde{h}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{h}^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 2)}} \cdot \frac{\partial h^{(\mathcal T - 2)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}} \\ & \frac{\partial \mathcal L^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T - 1)}}{\partial \widetilde{h}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{h}^{(\mathcal T - 1)}}{\partial r^{(\mathcal T - 1)}} \cdot \frac{\partial r^{(\mathcal T - 1)}}{\partial h^{(\mathcal T - 2)}} \cdot \frac{\partial h^{(\mathcal T - 2)}}{\partial \mathcal W_{h^{(\mathcal T - 2)} \Rightarrow \mathcal Z^{(\mathcal T - 2)}}} \end{aligned} \end{cases}$
存在 $\times 4$ 条是从 $\mathcal L^{(\mathcal T)}$ 开始，从 $\mathcal T \Rightarrow \mathcal T- 2$ 时刻传递的路径。
这个就不写了，太墨迹了。

这仅仅是 $\mathcal T \Rightarrow \mathcal T - 2$ 时刻的路径数量， $\mathcal T - 3$ 时刻关于 $\mathcal L^{(\mathcal T)}$ 相关的梯度路径有 $\times 4 \times 4 = 64$ 条，以此类推。

总结

和 $\text{LSTM}$ 的反向传播路径相比， $\text{LSTM}$ 仅仅从 $\mathcal T \Rightarrow \mathcal T - 2$ 时刻传递的路径就有 $24$ 条，而 $\text{GRU}$ 仅有 $16$ 条，相比之下，极大地减小了反向传播路径的数量；
降低了时间、空间复杂度;