引言

上一节介绍了递归神经网络前馈计算过程的基本逻辑，以及作为语言模型时，衡量一个语言模型的优劣性指标——困惑度。本节介绍 $\text{Softmax}$ 函数的反向传播 $(\text{Backward Propagation,BP})$ 过程。

总结：递归神经网络的前馈计算过程

场景构建

已知某特定时刻的递归神经网络神经元表示如下：
在这里插入图片描述
其中：

$x_t$ 表示数据在 $t$ 时刻的输入，其维度格式为 $x_t \in \mathbb R^{n_x \times m \times 1}$ 。其中 $n_x$ 表示当前时刻输入向量的维数； $m$ 表示样本数量； $1$ 则表示当前所在时刻 $t$ 。
- 输入向量可能是‘词向量’，或者是其他描述序列单位的向量。而 $n_x$ 描述该向量的大小。
- $m$ 可表示为当前 $\text{Batch}$ 内的样本数量。
- 对应完整序列数据 $\mathcal X$ 可表示为如下形式。其中 $\mathcal T$ 表示输入时刻的具体数量。
  $\mathcal X = (x_1,x_2,\cdots,x_t,x_{t+1},\cdots,x_{\mathcal T})^T \in \mathbb R^{n_x \times m \times \mathcal T}$
$h_t$ 表示 $t$ 时刻的序列信息，也是要传递到 $t + 1$ 时刻的值；它的维度格式表示为：
这里 $n_h$ 表示隐藏状态的维数大小;它由参数 $\mathcal W_{\mathcal H \Rightarrow \mathcal H},\mathcal W_{\mathcal H \Rightarrow \mathcal X}$ 决定; $h_{t+1} \in \mathbb R^{n_h \times m \times 1}$ 同理。
$h_t \in \mathbb R^{n_h \times m \times 1}$
对应的隐藏层矩阵 $\mathcal H \in \mathbb R^{n_h \times m \times \mathcal T}$ 。因为每一进入一个输入，都会得到一个相应更长的序列信息。因此 $\mathcal X,\mathcal H$ 共用同一个 $\mathcal T$ 。
$\mathcal O_{t+1}$ 表示数据传入后计算产生的预测值，它的维度格式表示为：
其中 $n_{\mathcal O}$ 表示预测输出结果的长度。
$\mathcal O_{t+1} \in \mathbb R^{n_{\mathcal O} \times m \times \mathcal 1}$
同理，对应的输出矩阵 $\mathcal O \in \mathbb R^{n_{\mathcal O} \times m \times \mathcal T_{\mathcal O}}$ ,这里的 $\mathcal T_{\mathcal O}$ 表示输出时刻的数量。需要注意的是， $\mathcal T_{\mathcal O}$ 和 $\mathcal T$ 是两个概念。也就是说，输出的序列长度和输入长度无关，它与权重参数 $\mathcal W_{\mathcal H \Rightarrow \mathcal O}$ 相关。

前馈计算描述

为了方便描述，将上述过程中的序列下标表示为序列上标：
$x_t,h_t,h_{t+1},\mathcal O_{t+1} \Rightarrow x^{(t)},h^{(t)},h^{(t+1)},\mathcal O^{(t+1)}$

关于第 $t$ 时刻神经元的前馈计算过程表示如下：
需要注意的是，这里的 $h^{(t+1)},\mathcal O^{(t+1)}$ 表示对下一时刻信息的预测，而这个预测过程是在 $t$ 时刻完成的。

序列信息 $h^{(t+1)}$ 的计算过程：
$\begin{cases} \mathcal Z_1^{(t+1)} = \mathcal W_{h^{(t)} \Rightarrow h^{(t+1)}}\cdot h^{(t)} + \mathcal W_{x^{(t)} \Rightarrow h^{(t+1)}} \cdot x^{(t)} + b_{h^{(t+1)}} \\ \quad \\ h^{(t+1)} = \text{Tanh}(\mathcal Z_1^{(t)}) \end{cases}$
预测值 $\mathcal O^{(t+1)}$ 的计算过程：
关于后验概率 $\mathcal P_{model}[\mathcal O^{(t+1)} \mid x^{(t)},h^{(t+1)}]$ 本质上是一个分类任务——从该分布中选择概率最高的结果作为 $x^{(t+1)}$ 的结果，这里使用 $\text{Softmax}$ 函数对各结果对应的概率分布信息进行评估。
$\begin{cases} \mathcal Z_2^{(t+1)} = \mathcal W_{h^{(t+1)} \Rightarrow \mathcal O^{(t+1)}} \cdot h^{(t+1)} + b_{\mathcal O^{(t+1)}} \\ \quad \\ \begin{aligned} \mathcal O^{(t+1)} & = \text{Softmax}(\mathcal Z_2^{(t+1)}) \\ & = \frac{\exp \left\{\mathcal Z_2^{(t+1)}\right\}}{\sum_{i=1}^{n_{\mathcal O}}\exp \left\{\mathcal Z_{2;i}^{(t+1)}\right\}} \\ \end{aligned} \end{cases}$

其中，公式中出现的各参数维度格式表示如下：
$\begin{aligned} & \mathcal Z_1:\begin{cases} \mathcal W_{h^{(t)} \Rightarrow h^{(t+1)}} \in \mathbb R^{1 \times n_h} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal H} \in \mathbb R^{n_h \times n_h} \\ \mathcal W_{x^{(t)} \Rightarrow h^{(t+1)}} \in \mathbb R^{1 \times n_x} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal H} \in \mathbb R^{n_h \times n_x} \\ b_{\mathcal h^{(t+1)}} \in \mathbb R^{1 \times 1} \Rightarrow b_{\mathcal H} \in \mathbb R^{n_h \times 1} \end{cases} \\ & \mathcal Z_2:\begin{cases} \mathcal W_{h^{(t+1)} \Rightarrow \mathcal O^{(t+1)}} \in \mathbb R^{} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal O} \in \mathbb R^{n_{\mathcal O} \times n_h} \\ b_{\mathcal O^{(t+1)}} \in \mathbb R^{1 \times 1} \Rightarrow b_{\mathcal O} \in \mathbb R^{n_{\mathcal O} \times 1} \end{cases} \end{aligned}$

铺垫： $\text{Softmax}$ 的反向传播过程

场景构建

假设一个 $\mathcal L$ 层全连接神经网络用作 $\mathcal C$ 分类的分类任务，并且已知由 $m$ 个训练样本构成的训练集 $\mathcal D$ ：
$\mathcal D = \{(x^{(i)},y^{(i)})\}_{i=1}^m$
中间的计算过程忽略。仅观察输出结果。设每一个 $x^{(i)}(i=1,2,\cdots,m)$ 的对应预测结果为 $\hat y^{(i)}$ ，使用交叉熵 $(\text{CrossEntropy})$ 对其计算损失：
$\mathscr L \left[y^{(i)},\hat y^{(i)}\right] = -\sum_{j=1}^{\mathcal C} y_j^{(i)} \log \hat y_j^{(i)}$
相应地，对训练集 $\mathcal D$ 的损失函数 $\mathcal J(\mathcal W)$ 表示为：
这里将偏置项 $b$ 忽略掉了。
$\mathcal J(\mathcal W) = \frac{1}{m} \sum_{i=1}^m \mathscr L \left[y^{(i)},\hat y^{(i)}\right]$

关于最后一层神经网络输出 $\mathcal Z^{(\mathcal L)}$ 与 $\text{Softmax}$ 激活函数的前馈计算过程表示如下：
$\hat y = a^{(\mathcal L)} = \text{Softmax}(\mathcal Z^{(\mathcal L)})$

$\text{Softmax}$ 反向传播过程

以单个样本 $\in \mathcal D$ 为例。首先计算该样本的损失函数结果 $\mathscr L(y,\hat y)$ 关于预测输出 $\hat y = a^{(\mathcal L)}$ 的导数结果：
$\begin{aligned} \frac{\partial \mathscr L}{\partial a^{(\mathcal L)}} & = \frac{\partial}{\partial a^{(\mathcal L)}} \left[-\sum_{j=1}^{\mathcal C} y_j \log \hat y_j \right] \\ & = \frac{\partial}{\partial a^{(\mathcal L)}} \left[- (y_1 \log \hat y_1 + y_2 \log \hat y_2 + \cdots + y_{\mathcal C} \log \hat y_{\mathcal C}) \right] \\ & = \frac{\partial}{\partial a^{(\mathcal L)}} \left[ - (y_1 \log a_1^{(\mathcal L)} + y_2 \log a_2^{(\mathcal L)} + \cdots + y_{\mathcal C} \log a_{\mathcal C}^{(\mathcal L)})\right] \end{aligned}$
很明显， $\mathscr L$ 表示各维度的连加和，是一个标量；而此时的 $a^{(\mathcal L)}$ 是一个 $\times \mathcal C$ 的向量。其求导结果表示如下：
标量对向量求导见文章末尾链接，侵删。
$\begin{aligned} \frac{\partial \mathscr L}{\partial a^{(\mathcal L)}} & = \left[\frac{\partial \mathscr L}{\partial a_1^{(\mathcal L)}},\cdots,\frac{\partial \mathscr L}{\partial a_{\mathcal C}^{(\mathcal L)}}\right]\\ & = \left\{\frac{\partial}{\partial a_1^{(\mathcal L)}} \left[-(\underbrace{y_1 \log a_1^{(\mathcal L)}}_{a_1^{(\mathcal L) 相关}} + \underbrace{\cdots + y_{\mathcal C} \log a_{\mathcal C}^{(\mathcal L)}}_{a_1^{(\mathcal L)无关}})\right],\cdots,\frac{\partial}{\partial a_{\mathcal C}^{(\mathcal L)}} \left[-(\underbrace{y_1 \log a_1^{(\mathcal L)} + \cdots}_{a_{\mathcal C}^{(\mathcal L)无关}} + \underbrace{y_{\mathcal C} \log a_{\mathcal C}^{(\mathcal L)}}_{a_{\mathcal C}^{(\mathcal L)相关}})\right]\right\} \\ & = \left[-\frac{y_1}{a_1^{(\mathcal L)}},\cdots,-\frac{y_{\mathcal C}}{a_{\mathcal C}^{(\mathcal L)}}\right] \\ & = -\frac{(y_1,\cdots,y_{\mathcal C})}{\left(a_1^{(\mathcal L)},\cdots,a_{\mathcal C}^{(\mathcal L)} \right)} \\ & = -\frac{y}{\hat y} \end{aligned}$
继续向前传播，计算 $\begin{aligned}\frac{\partial \mathscr L}{\partial \mathcal Z^{(\mathcal L)}}\end{aligned}$ ：
$\frac{\partial \mathscr L}{\partial \mathcal Z^{(\mathcal L)}} = \frac{\partial \mathscr L}{\partial a^{(\mathcal L)}} \cdot \frac{\partial a^{(\mathcal L)}}{\partial \mathcal Z^{(\mathcal L)}}$
关于 $\begin{aligned}\frac{\partial a^{(\mathcal L)}}{\partial \mathcal Z^{(\mathcal L)}}\end{aligned}$ ，由于 $a^{(\mathcal L)},\mathcal Z^{(\mathcal L)}$ 均是 $\times \mathcal C$ 的向量。其导数结果表示如下：
这是一个 $\mathcal C \times \mathcal C \times 1$ 的三维张量。
$\begin{aligned} \frac{\partial a^{(\mathcal L)}}{\partial \mathcal Z^{(\mathcal L)}} & = \left[\frac{\partial a^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}},\cdots,\frac{\partial a^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\right]_{\mathcal C \times \mathcal C \times 1}^T \\ & = \left\{\frac{\partial}{\partial z_1^{(\mathcal L)}}\left[\frac{\exp(\mathcal Z^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\right],\cdots,\frac{\partial}{\partial z_{\mathcal C}^{(\mathcal L)}} \left[\frac{\exp(\mathcal Z^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_{\mathcal C}^{(\mathcal L)})}\right]\right\}_{\mathcal C \times \mathcal C \times 1}^T \end{aligned}$
这里以第一项为例，不可否认的是，它是一个 $\times \mathcal C$ 的向量结果。并且 $z_1^{(\mathcal L)}$ 是一个标量，它的导数结果表示如下：
其中 $\begin{aligned}\frac{\exp(\mathcal Z^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\end{aligned}$ 是输出结果 $a^{(\mathcal L)}$ 的第一个分量。记作 $a_1^{(\mathcal L)}$ .
$\frac{\partial}{\partial z_1^{(\mathcal L)}}\left[\frac{\exp(\mathcal Z^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\right] = \left\{\frac{\partial}{\partial z_1^{(\mathcal L)}}\underbrace{\left[\frac{\exp(z_1^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\right]}_{a_1^{(\mathcal L)}},\cdots,\frac{\partial}{\partial z_1^{(\mathcal L)}}\underbrace{\left[\frac{\exp(z_{\mathcal C}^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\right]}_{a_{\mathcal C}^{(\mathcal L)}}\right\}_{1 \times \mathcal C}$
继续以第一项为例，关于 $\begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned}$ 结果表示如下：
除法求导~
其中 $\left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})\right]'$ 中与 $z_1^{(\mathcal L)}$ 相关的只有第一项。因此该项结果为: $exp(z_i^{(L)})$ .
$\begin{aligned} \frac{\partial a_1^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}} & = \frac{\partial}{\partial z_1^{(\mathcal L)}}\left[\frac{\exp(\mathcal Z^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\right] \\ & = \frac{\left[\exp(z_1^{(\mathcal L)})\right]' \cdot \sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)}) - \exp(z_1^{(\mathcal L)}) \cdot \left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})\right]'}{\left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})\right]^2} \\ & = \frac{\exp(z_1^{(\mathcal L)}) \cdot \sum_{i=1}^{\mathcal C}\exp(z_i^{(\mathcal L)}) - \left[\exp(z_1^{(L)})\right]^2}{\left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})\right]^2} \end{aligned}$
分子提出 $\exp(z_1^{(\mathcal L)})$ ，分母平方项展开：
$\begin{aligned} \frac{\partial a_1^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}} & = \frac{\exp(z_1^{(\mathcal L)}) \cdot \left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)}) - \exp(z_1^{(\mathcal L)})\right]}{\left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})\right]^2} \\ & = \frac{\exp(z_1^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})} \cdot \frac{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)}) - \exp(z_1^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})} \\ & = \frac{\exp(z_1^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})} \cdot \left[1 - \frac{\exp(z_1^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})}\right] \\ & = a_1^{(\mathcal L)} \cdot (1 - a_1^{(\mathcal L)}) \end{aligned}$

同理，关于两个下标参数 $p, q$ ；当 $p = q$ 时，有：
$\frac{\partial a_q^{(\mathcal L)}}{\partial z_p^{(\mathcal L)}} = a_p^{(L)} \cdot (1 - a_p^{(L)}) \quad p,q \in \{1,2,\cdots,\mathcal C\};p = q$
当 $\neq q$ 时，对应结果表示为：
其中 $\begin{aligned}\left[\frac{\partial \exp(z_q^{(\mathcal L)})}{\partial z_p^{(\mathcal L)}}\right]_{p \neq q} = 0\end{aligned}$ 恒成立。
$\begin{aligned} \frac{\partial a_q^{(\mathcal L)}}{\partial z_p^{(\mathcal L)}} & = \frac{0 \cdot \sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)}) - \exp(z_q^{(\mathcal L)})\cdot \exp(z_p^{(\mathcal L)})}{\left[\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})\right]^2} \\ & = - \frac{exp(z_q^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})} \cdot \frac{\exp(z_p^{(\mathcal L)})}{\sum_{i=1}^{\mathcal C} \exp(z_i^{(\mathcal L)})} \\ & = -a_p \cdot a_q \end{aligned}$
至此， $\begin{aligned}\left[\frac{\partial a^{(\mathcal L)}}{\partial \mathcal Z^{(\mathcal L)}}\right]_{\mathcal C \times \mathcal C \times 1}\end{aligned}$ 中的所有项均可进行表示。将该三维张量进行压缩(删除最后一个维度)，可以得到一个雅可比矩阵 $(\text{Jacobian Matrix})$ ：
矩阵中的每一个元素均可使用上述两种方式进行表达。
$\frac{\partial a^{(\mathcal L)}}{\partial \mathcal Z^{(\mathcal L)}} = \begin{bmatrix} \begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned} & \begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_2^{(\mathcal L)}}\end{aligned} & \cdots& \begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\end{aligned} \\ \begin{aligned}\frac{\partial a_2^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned} & \begin{aligned}\frac{\partial a_2^{(\mathcal L)}}{\partial z_2^{(\mathcal L)}}\end{aligned} & \cdots& \begin{aligned}\frac{\partial a_2^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\end{aligned} \\ \vdots & \vdots &\ddots & \vdots\\ \begin{aligned}\frac{\partial a_{\mathcal C}^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned} & \begin{aligned}\frac{\partial a_{\mathcal C}^{(\mathcal L)}}{\partial z_2^{(\mathcal L)}}\end{aligned} & \cdots& \begin{aligned}\frac{\partial a_{\mathcal C}^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\end{aligned} \\ \end{bmatrix}_{\mathcal C \times \mathcal C}$
此时，对 $\begin{aligned}\frac{\partial \mathscr L}{\partial \mathcal Z^{(\mathcal L)}}\end{aligned}$ 进行表达：
其结果是一个 $\times \mathcal C$ 的向量格式。
$\begin{aligned} \frac{\partial \mathscr L}{\partial a^{(\mathcal L)}} \cdot \frac{\partial a^{(\mathcal L)}}{\partial\mathcal Z^{(\mathcal L)}} & = \left[-\frac{y_1}{a_1^{(\mathcal L)}},\cdots,-\frac{y_{\mathcal C}}{a_{\mathcal C}^{(\mathcal L)}}\right] \cdot \begin{bmatrix} \begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned} & \begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_2^{(\mathcal L)}}\end{aligned} & \cdots& \begin{aligned}\frac{\partial a_1^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\end{aligned} \\ \begin{aligned}\frac{\partial a_2^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned} & \begin{aligned}\frac{\partial a_2^{(\mathcal L)}}{\partial z_2^{(\mathcal L)}}\end{aligned} & \cdots& \begin{aligned}\frac{\partial a_2^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\end{aligned} \\ \vdots & \vdots &\ddots & \vdots\\ \begin{aligned}\frac{\partial a_{\mathcal C}^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}}\end{aligned} & \begin{aligned}\frac{\partial a_{\mathcal C}^{(\mathcal L)}}{\partial z_2^{(\mathcal L)}}\end{aligned} & \cdots& \begin{aligned}\frac{\partial a_{\mathcal C}^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\end{aligned} \\ \end{bmatrix}_{\mathcal C \times \mathcal C} \\ & = \left[- \sum_{i=1}^{\mathcal C} \frac{y_i}{a_i^{(\mathcal L)}} \cdot \frac{\partial a_i^{(\mathcal L)}}{\partial z_1^{(\mathcal L)}},\cdots,- \sum_{i=1}^{\mathcal C} \frac{y_i}{a_i^{(\mathcal L)}} \cdot \frac{\partial a_i^{(\mathcal L)}}{\partial z_{\mathcal C}^{(\mathcal L)}}\right]_{1 \times \mathcal C} \\ & = \left[- \sum_{i=1}^{\mathcal C} \frac{y_i}{a_i^{(\mathcal L)}} \cdot \frac{\partial a_i^{(\mathcal L)}}{\partial z_j^{(\mathcal L)}}\right]_{1 \times \mathcal C} \quad j =1,2,\cdots,\mathcal C \end{aligned}$
将 $\begin{aligned}\frac{\partial a_i^{(\mathcal L)}}{\partial z_j^{(\mathcal L)}}(i,j \in \{1,2,\cdots,\mathcal C\}) = \begin{cases}a_i(1 - a_j) \quad i = j \\ -a_i \cdot a_j \quad i \neq j \end{cases}\end{aligned}$ 两种情况代入到上式中：
可以消掉 $a_i^{(\mathcal L)}$ .
需要注意的是，这里的连加号 $\sum_{i=1}^{\mathcal C}$ 是均满足条件时的累加结果。如果只有一项满足条件，那么 $\mathcal C = 1$ ，以此类推。
$\sum_{i=1}^{\mathcal C} \frac{y_i}{a_i^{(\mathcal L)}} \cdot \frac{\partial a_i^{(\mathcal L)}}{\partial z_j^{(\mathcal L)}} = \begin{cases} \begin{aligned} & \sum_{i=1}^{\mathcal C} y_i \cdot a_j^{(\mathcal L)} - y_i \quad i = j \\ & \sum_{i=1}^{\mathcal C} y_i \cdot a_j^{(\mathcal L)} \quad i \neq j \end{aligned} \end{cases}$
关于 $\begin{aligned} \left[\frac{\partial \mathscr L}{\partial a^{(\mathcal L)}} \cdot \frac{\partial a^{(\mathcal L)}}{\partial\mathcal Z^{(\mathcal L)}}\right]_{1 \times \mathcal C}\end{aligned}$ 中的结果，其每一项内的连加项中，只有一项是 $i = j$ 的情况。因而对 $\times \mathcal C$ 向量中的每一项均执行如下操作：
就是分成 $i = j$ 的 $1$ 项与 $\neq j$ 的 $\mathcal C - 1$ 项分别运算。
其中 $\begin{aligned}\sum_{i=1}^{\mathcal C}y_i\end{aligned}$ 是真实标签向量各分量之和。而真实标签中只有 ${0,1\}$ 两种元素(是该分类的为 $1$ ,不是该分类的为 $0$ )因此, $\begin{aligned}\sum_{i=1}^{\mathcal C}y_i\end{aligned}$ = 1.
$\begin{aligned} -\sum_{i=1}^{\mathcal C} \frac{y_i}{a_i^{(\mathcal L)}} \cdot \frac{\partial a_i^{(\mathcal L)}}{\partial z_j^{(\mathcal L)}} & = \underbrace{-y_j + y_j \cdot a_j^{(\mathcal L)}}_{i = j} + \underbrace{\sum_{i \neq j} y_i \cdot a_i^{(\mathcal L)}}_{i \neq j} \\ & = -y_j + \left(y_j \cdot a_j^{(\mathcal L)} + \sum_{i \neq j} y_i \cdot a_j^{(\mathcal L)}\right) \\ & = -y_j + a_j^{(\mathcal L)} \cdot \sum_{i=1}^{\mathcal C}y_i \\ & = a_j^{(\mathcal L)} - y_j \end{aligned}$
这仅仅是一个分量的结果，所有分量的结果组成一个 $\times \mathcal C$ 的向量：
$\left[a_j^{(\mathcal L)} - y_j\right]_{1 \times \mathcal C} \quad j = 1,2,\cdots,\mathcal C \Rightarrow a^{(\mathcal L)} - y$
由于 $a^{(\mathcal L)} = \hat y$ ，因此对于递归神经网络中某时刻条件下， $\begin{aligned}\frac{\partial \mathscr L}{\partial \mathcal Z^{(\mathcal L)}}\end{aligned}$ 中某分量 $\in \{1,2,\cdots,\mathcal C\})$ 结果可表示为：
$\hat y_i^{(t)} - \mathbb I_{i;y^{(t)}}$
其实它描述的就是各分量的相减结果：
对应《机器学习》(花书) P234 10.2.2 公式10.18
$\begin{pmatrix} \hat y_1^{(t)} \\ \hat y_2^{(t)} \\ \vdots \\ \hat y_{\mathcal C}^{(t)} \\ \end{pmatrix} - \begin{pmatrix} y_1^{(t)} \\ y_2^{(t)} \\ \vdots \\ y_{\mathcal C}^{(t)} \\ \end{pmatrix} \quad \sum_{i=1}^{\mathcal C} y_i^{(t)} = 1;y_i^{(t)} \in \{0,1\}$