- https://www.youtube.com/watch?v=uXY18nzdSsM
- component by component(auto regresive)的生成模型存在的问题(生成的顺序,生成的速度慢);variational auto-encoder存在的问题(optimize a maxihood lower bound,是一个近似);generative adaversarial network(unstable training)
- generator:
– G G G一个Network,定义了一个probability distribution p G p_G pG,现在有一个从正态分布中采样的 z z z,经过 G G G得到 x = G ( z ) x = G(z) x=G(z), x x x满足 p G ( x ) p_G(x) pG(x)的分布,希望这个分布和 p d a t a ( x ) p_{data}(x) pdata(x)越接近越好, { x 1 , x 2 , ⋯ , x m } ∈ p d a t a ( x ) \{x^1,x^2,\cdots,x^m\}\in p_{data}(x) {x1,x2,⋯,xm}∈pdata(x);
– 如何使得两个分布越相近越好,一般是maximize likelihood G ∗ = a r g m a x G ∑ i = 1 m log p G ( x i ) G^* = argmax_G\sum_{i = 1}^m\log{p_G}(x^i) G∗=argmaxG∑i=1mlogpG(xi),使得 x i x^i xi被 G G G产生的几率越大越好;
– flow model直接优化上式; - math background
– jacobian: x = f ( z ) , z = [ z 1 z 2 ] , x = [ x 1 x 2 ] x = f(z), z = \left [\begin{matrix} z_1\\z_2\end{matrix}\right],x = \left [\begin{matrix} x_1\\x_2\end{matrix}\right] x=f(z),z=[z1z2],x=[x1x2],jacobian的定义是 J f = [ ∂ x 1 / ∂ z 1 ∂ x 1 / ∂ z 2 ∂ x 2 / ∂ z 1 ∂ x 2 / ∂ z 2 ] J_f = \left [\begin{matrix} \partial x_1/\partial z_1 & \partial x_1 / \partial z_2 \\ \partial x_2 / \partial z_1 & \partial x_2 / \partial z_2\end{matrix}\right] Jf=[∂x1/∂z1∂x2/∂z1∂x1/∂z2∂x2/∂z2],此时 z = f − 1 ( x ) z = f^{-1}(x) z=f−1(x), J f − 1 = [ ∂ z 1 / ∂ x 1 ∂ z 1 / ∂ x 2 ∂ z 2 / ∂ x 1 ∂ z 2 / ∂ x 2 ] J_{f^{-1}} = \left [\begin{matrix} \partial z_1/\partial x_1 & \partial z_1 / \partial x_2 \\ \partial z_2 / \partial x_1 & \partial z_2 / \partial x_2\end{matrix}\right] Jf−1=[∂z1/∂x1∂z2/∂x1∂z1/∂x2∂z2/∂x2];两个jacobian J f ∗ J f − 1 = I J_f * J_{f^{-1}} = I Jf∗Jf−1=I,两者互逆
– determinant:行列式det(A),A是一个方阵, d e t ( A ) = 1 / d e t ( A − 1 ) det(A) = 1/det(A^{-1}) det(A)=1/det(A−1);行列式可以理解为高维空间中的体积的概念;
– change of variable: 假设现在有一个正态分布 π ( z ) \pi(z) π(z), x = f ( z ) x = f(z) x=f(z), x x x满足分布 p ( x ) p(x) p(x), p ( x ′ ) Δ x = π ( z ′ ) Δ z → p ( x ′ ) = π ( z ′ ) Δ z Δ x → p ( x ′ ) = π ( z ′ ) ∣ d z d x ∣ p(x')\Delta x = \pi(z')\Delta z\rightarrow p(x') = \pi(z')\frac{\Delta z}{\Delta x}\rightarrow p(x') = \pi(z')\left| \frac{dz}{dx}\right | p(x′)Δx=π(z′)Δz→p(x′)=π(z′)ΔxΔz→p(x′)=π(z′) dxdz ,接下来扩展到二维,两块的面积相等, p ( x ′ ) ∣ d e t [ Δ x 11 Δ x 21 Δ x 12 Δ x 22 ] ∣ = π ( z ′ ) Δ z 1 Δ z 2 p(x')\left |det\left [\begin{matrix} \Delta x_{11} & \Delta x_{21}\\\Delta x_{12}& \Delta x_{22}\end{matrix}\right ]\right | = \pi(z')\Delta z_1\Delta z_2 p(x′) det[Δx11Δx12Δx21Δx22] =π(z′)Δz1Δz2,其中 Δ x 12 , Δ x 22 \Delta x_{12},\Delta x_{22} Δx12,Δx22分别是 z 2 z_2 z2改变的时候 x 1 , x 2 x_1,x_2 x1,x2的改变量, Δ x 11 , Δ x 21 \Delta x_{11},\Delta x_{21} Δx11,Δx21是当 z 1 z_1 z1改变的时候 x 1 , x 2 x_1,x_2 x1,x2的改变量;接下来进行整理: π ( z ′ ) = p ( x ′ ) ∣ 1 Δ z 1 Δ z 2 d e t [ Δ x 11 Δ x 21 Δ x 12 Δ x 22 ] ∣ = p ( x ′ ) ∣ d e t [ Δ x 11 / Δ z 1 Δ x 21 / Δ z 1 Δ x 12 / Δ z 2 Δ x 22 / Δ z 2 ] ∣ = p ( x ′ ) ∣ d e t [ ∂ x 1 / ∂ z 1 ∂ x 2 / ∂ z 1 ∂ x 1 / ∂ z 2 ∂ x 2 / ∂ z 2 ] ∣ = p ( x ′ ) ∣ d e t [ ∂ x 1 / ∂ z 1 ∂ x 1 / ∂ z 2 ∂ x 2 / ∂ z 1 ∂ x 2 / ∂ z 2 ] ∣ = p ( x ′ ) ∣ d e t ( J f ) ∣ \pi(z') = p(x')\left |\frac{1}{\Delta z_1\Delta z_2}det\left [\begin{matrix} \Delta x_{11} & \Delta x_{21}\\\Delta x_{12}& \Delta x_{22}\end{matrix}\right ]\right | = p(x')\left |det\left [\begin{matrix} \Delta x_{11} / \Delta z_1 & \Delta x_{21} / \Delta z_1\\\Delta x_{12} / \Delta z_2& \Delta x_{22} / \Delta z_2\end{matrix}\right ]\right | = p(x')\left |det\left [\begin{matrix} \partial x_{1} / \partial z_1 & \partial x_{2} / \partial z_1\\\partial x_{1} / \partial z_2& \partial x_{2} / \partial z_2\end{matrix}\right ]\right | = p(x')\left |det\left [\begin{matrix} \partial x_{1} / \partial z_1 & \partial x_{1} / \partial z_2\\\partial x_{2} / \partial z_1& \partial x_{2} / \partial z_2\end{matrix}\right ]\right | = p(x') |det(J_f)| π(z′)=p(x′) Δz1Δz21det[Δx11Δx12Δx21Δx22] =p(x′) det[Δx11/Δz1Δx12/Δz2Δx21/Δz1Δx22/Δz2] =p(x′) det[∂x1/∂z1∂x1/∂z2∂x2/∂z1∂x2/∂z2] =p(x′) det[∂x1/∂z1∂x2/∂z1∂x1/∂z2∂x2/∂z2] =p(x′)∣det(Jf)∣,也可以得到 p ( x ′ ) = π ( z ′ ) ∣ d e t ( J f − 1 ) ∣ p(x') = \pi(z')|det(J_{f^{-1}})| p(x′)=π(z′)∣det(Jf−1)∣
- flow model:原目标是maximize likelihood
G
∗
=
a
r
g
m
a
x
G
∑
i
=
1
m
log
p
G
(
x
i
)
G^* = argmax_G\sum_{i = 1}^m\log{p_G}(x^i)
G∗=argmaxG∑i=1mlogpG(xi),而
p
G
(
x
i
)
=
π
(
z
i
)
∣
d
e
t
(
J
G
−
1
)
∣
,
z
i
=
G
−
1
(
x
i
)
p_G(x^i) = \pi(z^i)|det(J_{G^{-1}})|,z^i = G^{-1}(x^i)
pG(xi)=π(zi)∣det(JG−1)∣,zi=G−1(xi),有
log
p
G
(
x
i
)
=
log
π
(
G
−
1
(
x
i
)
)
+
log
∣
d
e
t
(
J
G
−
1
)
∣
\log p_G(x^i) = \log \pi(G^{-1}(x^i)) + \log |det(J_{G^{-1}})|
logpG(xi)=logπ(G−1(xi))+log∣det(JG−1)∣;需要计算
d
e
t
(
J
G
)
,
G
−
1
det(J_G),G^{-1}
det(JG),G−1,为了保证invertible,输入和输出保持维度相同;
p 1 ( x i ) = π ( z i ) ( ∣ d e t ( J G 1 − 1 ) ∣ ) p_1(x^i) = \pi(z^i)(|det(J_{G_1^{-1}})|) p1(xi)=π(zi)(∣det(JG1−1)∣)
p 2 ( x i ) = π ( z i ) ( ∣ d e t ( J G 1 − 1 ) ∣ ) ( ∣ d e t ( J G 2 − 1 ) ∣ ) p_2(x^i) = \pi(z^i)(|det(J_{G_1^{-1}})|)(|det(J_{G_2^{-1}})|) p2(xi)=π(zi)(∣det(JG1−1)∣)(∣det(JG2−1)∣)
⋯ \cdots ⋯
p K ( x i ) = π ( z i ) ( ∣ d e t ( J G 1 − 1 ) ∣ ) ⋯ ( ∣ d e t ( J G K − 1 ) ∣ ) p_K(x^i) = \pi(z^i)(|det(J_{G_1^{-1}})|)\cdots (|det(J_{G_K^{-1}})|) pK(xi)=π(zi)(∣det(JG1−1)∣)⋯(∣det(JGK−1)∣)
log p K ( x i ) = log π ( z i ) + ∑ h = 1 K log ∣ d e t ( J G K − 1 ) ∣ , z i = G 1 − 1 ( ⋯ G K − 1 ( x i ) ) \log p_K(x^i) = \log\pi(z^i) + \sum_{h = 1}^K \log|det(J_{G_K^{-1}})|,z^i = G_1^{-1}(\cdots G_K^{-1}(x^i)) logpK(xi)=logπ(zi)+∑h=1Klog∣det(JGK−1)∣,zi=G1−1(⋯GK−1(xi))
可以看到上式只有 G − 1 G^{-1} G−1 - coupling layer:nice nvp glow
前向:
逆向
接下来计算Jacobian
上面是一层的情况,接下来叠起来: