Week 6 hw3-1 全连接网络反向传播推导
折腾了半天,记录一下。
作业中网络由若干全连接层+ReLU组成,输出层的函数为softmax,损失函数为交叉熵。
一、记号
设网络有
n
n
n层。如图,当
i
<
n
i<n
i<n时,我们有如下几条式子成立:
{
z
j
i
−
1
=
ReLU
(
x
j
i
−
1
)
x
j
i
=
ReLU
(
z
^
j
i
−
1
)
z
^
j
i
−
1
=
∑
k
=
1
d
i
−
1
z
k
i
−
1
w
k
j
i
−
1
+
b
j
i
−
1
\begin{cases} z^{i-1}_j=\text{ReLU}(x^{i-1}_j)\\ x^{i}_j=\text{ReLU}(\hat{z}^{i-1}_j)\\ \hat{z}_{j}^{i-1}=\sum\limits_{k=1}^{d_{i-1}}z^{i-1}_kw_{kj}^{i-1}+b_j^{i-1} \end{cases}
⎩
⎨
⎧zji−1=ReLU(xji−1)xji=ReLU(z^ji−1)z^ji−1=k=1∑di−1zki−1wkji−1+bji−1
同时,记:
{
x
i
−
1
=
[
x
1
i
−
1
,
x
2
i
−
1
,
…
,
x
d
i
−
1
i
−
1
]
x
i
=
[
x
1
i
−
1
,
x
2
i
−
1
,
…
,
x
d
i
i
]
z
i
−
1
=
[
z
1
i
−
1
,
z
2
i
−
1
,
…
,
z
d
i
−
1
i
−
1
]
z
^
i
−
1
=
[
z
^
1
i
−
1
,
z
^
2
i
−
1
,
…
,
z
^
d
i
i
−
1
]
b
i
−
1
=
[
b
1
i
−
1
,
b
2
i
−
1
,
…
,
b
d
i
i
−
1
]
W
i
−
1
=
(
w
11
i
−
1
w
12
i
−
1
…
w
1
d
i
i
−
1
w
21
i
−
1
w
22
i
−
1
…
w
2
d
i
i
−
1
⋮
⋮
⋱
⋮
w
d
i
−
1
1
i
−
1
w
d
i
−
1
2
i
−
1
…
w
d
i
−
1
d
i
i
−
1
)
\begin{cases} \mathbf{x}^{i-1}=[x_1^{i-1},x_2^{i-1},\dots,x_{d_{i-1}}^{i-1}]\\ \mathbf{x}^{i}=[x_1^{i-1},x_2^{i-1},\dots,x_{d_{i}}^{i}]\\ \mathbf{z}^{i-1}=[z_1^{i-1},z_2^{i-1},\dots,z_{d_{i-1}}^{i-1}]\\ \hat{\mathbf{z}}^{i-1}=[\hat{z}_1^{i-1},\hat{z}_2^{i-1},\dots,\hat{z}_{d_{i}}^{i-1}]\\ \mathbf{b}^{i-1}=[b_1^{i-1},b_2^{i-1},\dots,b_{d_i}^{i-1}]\\ \mathbf{W}^{i-1}= \begin{pmatrix} w_{11}^{i-1} & w_{12}^{i-1} & \dots & w_{1d_i}^{i-1}\\ w_{21}^{i-1} & w_{22}^{i-1} & \dots & w_{2d_i}^{i-1}\\ \vdots & \vdots & \ddots & \vdots\\ w_{d_{i-1}1}^{i-1} & w_{d_{i-1}2}^{i-1} & \dots & w_{d_{i-1}d_i}^{i-1} \end{pmatrix} \end{cases}
⎩
⎨
⎧xi−1=[x1i−1,x2i−1,…,xdi−1i−1]xi=[x1i−1,x2i−1,…,xdii]zi−1=[z1i−1,z2i−1,…,zdi−1i−1]z^i−1=[z^1i−1,z^2i−1,…,z^dii−1]bi−1=[b1i−1,b2i−1,…,bdii−1]Wi−1=
w11i−1w21i−1⋮wdi−11i−1w12i−1w22i−1⋮wdi−12i−1……⋱…w1dii−1w2dii−1⋮wdi−1dii−1
那么我们得到矩阵形式:
{
z
i
−
1
=
ReLU
(
x
i
−
1
)
x
i
=
ReLU
(
z
^
i
−
1
)
z
^
i
−
1
=
z
i
−
1
W
i
−
1
+
b
i
−
1
\begin{cases} \mathbf{z}^{i-1}=\text{ReLU}(\mathbf{x}^{i-1})\\ \mathbf{x}^i=\text{ReLU}(\hat{\mathbf{z}}^{i-1})\\ \hat{\mathbf{z}}^{i-1}=\mathbf{z}^{i-1}\mathbf{W}^{i-1}+\mathbf{b}^{i-1} \end{cases}
⎩
⎨
⎧zi−1=ReLU(xi−1)xi=ReLU(z^i−1)z^i−1=zi−1Wi−1+bi−1
当
i
=
n
i=n
i=n时,记
i
i
i位置的真实标签为
y
i
y_{i}
yi,预测结果为
y
^
i
\hat{y}_i
y^i。我们有:
{
z
j
n
−
1
=
ReLU
(
x
j
n
−
1
)
y
^
j
=
[
softmax
(
z
^
1
n
−
1
,
z
^
2
n
−
1
…
,
z
^
d
n
n
−
1
)
]
j
z
^
j
n
−
1
=
∑
k
=
1
d
n
−
1
z
k
i
−
1
w
k
j
n
−
1
+
b
j
n
−
1
\begin{cases} z^{n-1}_j=\text{ReLU}(x^{n-1}_j)\\ \hat{y}_j=[\text{softmax}(\hat{z}^{n-1}_1,\hat{z}^{n-1}_2\dots,\hat{z}^{n-1}_{d_n})]_j\\ \hat{z}_{j}^{n-1}=\sum\limits_{k=1}^{d_{n-1}}z^{i-1}_kw_{kj}^{n-1}+b_j^{n-1} \end{cases}
⎩
⎨
⎧zjn−1=ReLU(xjn−1)y^j=[softmax(z^1n−1,z^2n−1…,z^dnn−1)]jz^jn−1=k=1∑dn−1zki−1wkjn−1+bjn−1
记
y
=
[
y
1
,
y
2
,
…
,
y
d
n
]
,
y
^
=
[
y
^
1
,
y
^
2
,
…
,
y
^
d
n
]
\mathbf{y}=[y_1,y_2,\dots,y_{d_n}],\hat{\mathbf{y}}=[\hat y_1,\hat y_2,\dots,\hat y_{d_n}]
y=[y1,y2,…,ydn],y^=[y^1,y^2,…,y^dn]。其中
y
\mathbf{y}
y为one-hot向量。
那么我们得到矩阵形式:
{
z
n
−
1
=
ReLU
(
x
n
−
1
)
y
^
=
softmax
(
z
^
n
−
1
)
z
^
n
−
1
=
z
n
−
1
W
n
−
1
+
b
n
−
1
\begin{cases} \mathbf{z}^{n-1}=\text{ReLU}(\mathbf{x}^{n-1})\\ \hat{\mathbf{y}}=\text{softmax}(\hat{\mathbf{z}}^{n-1})\\ \hat{\mathbf{z}}^{n-1}=\mathbf{z}^{n-1}\mathbf{W}^{n-1}+\mathbf{b}^{n-1} \end{cases}
⎩
⎨
⎧zn−1=ReLU(xn−1)y^=softmax(z^n−1)z^n−1=zn−1Wn−1+bn−1
记损失函数
J
(
y
,
y
^
)
=
−
∑
i
=
1
d
n
y
i
log
y
^
i
J(\mathbf{y},\hat{\mathbf{y}})=-\sum\limits_{i=1}^{d_n}y_i\log \hat{y}_i
J(y,y^)=−i=1∑dnyilogy^i。
二、推导
接下来计算 ∇ J W l \nabla J_{\mathbf{W}^l} ∇JWl与 ∇ J b l \nabla J_{\mathbf{b}^l} ∇Jbl。
画出计算图,得到:
∇
J
w
i
j
l
=
∂
J
∂
z
^
j
l
∂
z
^
j
l
∂
w
i
j
l
=
∂
J
∂
z
^
j
l
z
i
l
∇
J
b
j
l
=
∂
J
∂
z
^
j
l
∂
z
^
j
l
∂
b
j
l
=
∂
J
∂
z
^
j
l
\nabla J_{w^l_{ij}}=\frac{\partial J}{\partial \hat{z}_j^l}\frac{\partial \hat{z}_j^l}{\partial w_{ij}^l}=\frac{\partial J}{\partial \hat{z}_j^l}z^l_{i}\\ \nabla J_{b^l_{j}}=\frac{\partial J}{\partial \hat{z}_j^l}\frac{\partial \hat{z}_j^l}{\partial b^l_{j}}=\frac{\partial J}{\partial \hat{z}_j^l}
∇Jwijl=∂z^jl∂J∂wijl∂z^jl=∂z^jl∂Jzil∇Jbjl=∂z^jl∂J∂bjl∂z^jl=∂z^jl∂J
写成矩阵形式,得到:
∇
J
W
l
=
(
z
l
)
T
∇
J
z
^
l
∇
J
b
l
=
∇
J
z
^
l
\nabla J_{\mathbf{W}^l}=(\mathbf{z}^l)^{T}\nabla J_{\mathbf{\hat{z}}^l}\\ \nabla J_{\mathbf{b}^l}=\nabla J_{\mathbf{\hat{z}}^l}
∇JWl=(zl)T∇Jz^l∇Jbl=∇Jz^l
于是只需要计算
∇
J
z
^
l
\nabla J_{\mathbf{\hat{z}}^l}
∇Jz^l即可。我们尝试构造递推式计算。画出计算图,得到:
∇
J
z
^
j
l
=
(
∑
k
=
1
d
l
+
1
∂
J
∂
z
^
k
l
+
1
∂
z
^
k
l
+
1
∂
x
j
l
+
1
)
∂
x
j
l
+
1
∂
z
^
j
l
=
(
∑
k
=
1
d
l
+
1
∂
J
∂
z
^
k
l
+
1
w
j
k
l
+
1
)
d ReLU
d
x
∣
x
=
z
^
j
l
\nabla J_{\hat{z}_{j}^l}=(\sum_{k=1}^{d_{l+1}}\frac{\partial J}{\partial \hat{z}_{k}^{l+1}}\frac{\partial \hat{z}_{k}^{l+1}}{\partial x_j^{l+1}})\frac{\partial x_j^{l+1}}{\partial \hat{z}_{j}^l}=(\sum_{k=1}^{d_{l+1}}\frac{\partial J}{\partial \hat{z}_{k}^{l+1}}w_{jk}^{l+1})\left.\dfrac{\text{d ReLU}}{\text{d}x}\right|_{x=\hat{z}_{j}^{l}}
∇Jz^jl=(k=1∑dl+1∂z^kl+1∂J∂xjl+1∂z^kl+1)∂z^jl∂xjl+1=(k=1∑dl+1∂z^kl+1∂Jwjkl+1)dxd ReLU
x=z^jl
写成矩阵形式,得到:
∇
J
z
^
l
=
∇
J
z
^
l
+
1
W
l
+
1
ReLU’
(
z
^
l
)
\nabla J_{\hat{\mathbf{z}}^l}=\nabla J_{\hat{\mathbf{z}}^{l+1}}\mathbf{W}^{l+1}\text{ReLU'}(\hat{\mathbf{z}}^l)
∇Jz^l=∇Jz^l+1Wl+1ReLU’(z^l)
因此若得到
∇
J
z
^
n
−
1
\nabla J_{\hat{\mathbf{z}}^{n-1}}
∇Jz^n−1则完成计算。下面计算
∇
J
z
^
n
−
1
\nabla J_{\hat{\mathbf{z}}^{n-1}}
∇Jz^n−1。
由于 y \mathbf{y} y为one-hot向量,若真实标签的类别为 i i i,那么我们有 J ( y , y ^ ) = − y i log y ^ i = − log y ^ i J(\mathbf{y},\hat{\mathbf{y}})=-y_i\log\hat{y}_i=-\log\hat{y}_i J(y,y^)=−yilogy^i=−logy^i。
画出计算图,得到:
∇
J
z
^
j
n
−
1
=
∂
J
∂
z
^
j
n
−
1
=
∑
k
=
1
d
n
∂
J
∂
y
^
k
∂
y
^
k
∂
z
^
j
n
−
1
=
∂
J
∂
y
^
i
∂
y
^
i
∂
z
^
j
n
−
1
=
−
1
y
^
i
∂
y
^
i
∂
z
^
j
n
−
1
\nabla J_{\hat{z}_j^{n-1}}=\frac{\partial J}{\partial \hat{z}_j^{n-1}}=\sum_{k=1}^{d_n}\frac{\partial J}{\partial\hat{y}_k}\frac{\partial\hat{y}_k}{\partial \hat{z}_j^{n-1}}=\frac{\partial J}{\partial\hat{y}_i}\frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}=-\frac{1}{\hat{y}_i}\frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}
∇Jz^jn−1=∂z^jn−1∂J=k=1∑dn∂y^k∂J∂z^jn−1∂y^k=∂y^i∂J∂z^jn−1∂y^i=−y^i1∂z^jn−1∂y^i
由于
y
^
i
=
exp
(
z
^
i
n
−
1
)
∑
k
=
1
d
n
exp
(
z
^
k
n
−
1
)
\hat{y}_i=\frac{\exp(\hat{z}_i^{n-1})}{\sum_{k=1}^{d_n}\exp(\hat{z}_k^{n-1})}
y^i=∑k=1dnexp(z^kn−1)exp(z^in−1),下面进行分类讨论。
当
i
≠
j
i\not=j
i=j时,有:
∂
y
^
i
∂
z
^
j
n
−
1
=
−
exp
(
z
^
i
n
−
1
)
(
∑
k
=
1
d
n
exp
(
z
^
k
n
−
1
)
)
2
exp
(
z
^
j
n
−
1
)
=
−
y
^
i
y
^
j
\frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}=-\frac{\exp(\hat{z}_i^{n-1})}{(\sum_{k=1}^{d_n}\exp(\hat{z}_k^{n-1}))^2}\exp(\hat{z}_j^{n-1})=-\hat{y}_i\hat{y}_j
∂z^jn−1∂y^i=−(∑k=1dnexp(z^kn−1))2exp(z^in−1)exp(z^jn−1)=−y^iy^j
当
i
=
j
i=j
i=j时,有:
∂
y
^
i
∂
z
^
j
n
−
1
=
∑
k
=
1
,
k
≠
i
d
n
exp
(
z
^
k
n
−
1
)
(
∑
k
=
1
d
n
exp
(
z
^
k
n
−
1
)
)
2
exp
(
z
^
i
n
−
1
)
=
(
1
−
y
^
i
)
y
^
i
\frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}=\frac{\sum\limits_{k=1,k\not=i}^{d_n}\exp(\hat{z}_k^{n-1})}{(\sum\limits_{k=1}^{d_n}\exp(\hat{z}_k^{n-1}))^2}\exp(\hat{z}_i^{n-1})=(1-\hat{y}_i)\hat{y}_i
∂z^jn−1∂y^i=(k=1∑dnexp(z^kn−1))2k=1,k=i∑dnexp(z^kn−1)exp(z^in−1)=(1−y^i)y^i
代入
∇
J
z
^
j
n
−
1
\nabla J_{\hat{z}_j^{n-1}}
∇Jz^jn−1,我们得到:
∇
J
z
^
j
n
−
1
=
{
y
^
j
,
i
≠
j
−
1
+
y
^
j
,
i
=
j
\nabla J_{\hat{z}_j^{n-1}}= \begin{cases} \hat{y}_j, & i\not= j\\ -1+\hat{y}_j, & i = j \end{cases}
∇Jz^jn−1={y^j,−1+y^j,i=ji=j
至此推导完毕。