DDPM公式推导（四）

3 Diffusion models and denoising autoencoders

扩散模型可能看起来是一类受限制的潜在变量模型，但它们在实现中允许很大的自由度。必须选择正向过程的方差 $\beta_t$ 以及逆向过程的模型架构和高斯分布参数化。为了指导我们的选择，我们在扩散模型和去噪分数匹配之间建立了一个新的显式连接（第 3.2 节），从而为扩散模型提供了一个简化的加权变分边界目标（第 3.4 节）。最终，我们的模型设计通过简单性和实证结果得到了证明（第 4 节）。我们的讨论按公式（5）的术语进行分类。

3.1 Forward process and $L_T$

我们忽略了通过重参数化可以使前向过程的方差 $\beta_t$ 变得可学习的事实，而是将它们固定为常数（详见第4节）。因此，在我们的实现中，近似后验分布 $q$ 没有可学习的参数，因此 $L_T$ 在训练过程中是一个常数，可以忽略不计。

3.2 Reverse process and $L_{1: T-1}$

现在我们讨论 $p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \mathbf{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right)$ 中对 $\leq T$ 的选择。首先，我们将 $\boldsymbol{\Sigma}_\theta\left(\mathbf{x}_t, t\right)=\sigma_t^2 \mathbf{I}$ 设为未训练的时间依赖常数。实验上， $\sigma_t^2=\beta_t$ 和 $\sigma_t^2=\tilde{\beta}_t=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ 有类似的结果。第一个选择对于 $\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 是最优的，而第二个选择对于 $\mathbf{x}_0$ 确定为某一个点是最优的。这是对应于坐标单位方差数据的逆过程熵上下界的两个极端选择 [53]。

其次，为了表示均值 $\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)$ ，我们提出了一种特定的参数化方法，这种方法的动机来源于对 $L_t$ 的以下分析。对于 $p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \sigma_t^2 \mathbf{I}\right)$ ，我们可以写成：
$L_{t-1}=\mathbb{E}_q\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)\right\|^2\right]+C \quad(8)$
其中 $C$ 是一个不依赖于 $\theta$ 的常数。因此，我们看到 $\boldsymbol{\mu}_\theta$ 最直接的参数化方式是预测前向过程的后验均值 $\tilde{\boldsymbol{\mu}}_t$ 。但是，我们可以通过将公式 (4) 重参数化为 $\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)=\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}$ 对 $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ，并应用前向过程的后验公式 (7) 来进一步展开公式 (8)：
$\begin{aligned} L_{t-1}-C & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)-\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}\right)\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), t\right)\right\|^2\right] \quad(9)\\ & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_t^2}\left\|\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), t\right)\right\|^2\right]\quad(10) \end{aligned}$

公式（10）揭示了 $\boldsymbol{\mu}_\theta$ 必须预测 $\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}\right)$ 给定 $\mathbf{x}_t$ 。由于 $\mathbf{x}_t$ 作为模型的输入是可用的，我们可以选择参数化方式：
$\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)=\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t\right)\right)\right)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)\quad(11)$
其中 $\boldsymbol{\epsilon}_\theta$ 是一个函数逼近器，用于从 $\mathbf{x}_t$ 预测 $\boldsymbol{\epsilon}$ 。采样 $\mathbf{x}_{t-1} \sim p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ 相当于计算 $\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z}$ ，其中 $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 。完整的采样过程，算法2，类似于 Langevin 动力学，其中 $\epsilon_\theta$ 是数据密度的学习梯度。此外，使用参数化公式（11），公式（10）简化为：
$\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t\left(1-\bar{\alpha}_t\right)}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t\right)\right\|^2\right]\quad(12)$
这类似于由 $t$ 索引的多个噪声尺度上的去噪得分匹配 [55]。由于公式（12）等于 Langevin-like 逆过程（11）的变分界（一个项），我们看到优化类似于去噪得分匹配的目标等价于使用变分推断来拟合类似于 Langevin 动力学的采样链的有限时间边际。

总之，我们可以训练逆过程均值函数逼近器 $\boldsymbol{\mu}_\theta$ 来预测 $\tilde{\boldsymbol{\mu}}_t$ ，或者通过修改其参数化方式，我们可以训练它来预测 $\epsilon$ 。（还有预测 $\mathbf{x}_0$ 的可能性，但我们发现这会导致实验早期的样本质量较差。）我们已经证明了 $\boldsymbol{\epsilon}$ -预测参数化方式既类似于 Langevin 动力学，又将扩散模型的变分界简化为类似于去噪得分匹配的目标。然而，这只是 $p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ 的另一种参数化方式，因此在第4节中，我们通过比较预测 $\boldsymbol{\epsilon}$ 和预测 $\tilde{\boldsymbol{\mu}}_t$ 来验证其有效性。

3.3 Data scaling, reverse process decoder, and $L_0$

我们假设图像数据由整数 $\{0,1, \ldots, 255\}$ 线性缩放到 $[- 1, 1]$ 。这样确保神经网络逆过程在一致缩放的输入上操作，从标准正态先验 $p\left(\mathbf{x}_T\right)$ 开始。为了获得离散的对数似然，我们将逆过程的最后一项设置为从高斯分布 $\mathcal{N}\left(\mathbf{x}_0 ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_1, 1\right), \sigma_1^2 \mathbf{I}\right)$ 导出的独立离散解码器：
$\begin{aligned} p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right) & =\prod_{i=1}^D \int_{\delta_{-}\left(x_0^i\right)}^{\delta_{+}\left(x_0^i\right)} \mathcal{N}\left(x ; \mu_\theta^i\left(\mathbf{x}_1, 1\right), \sigma_1^2\right) d x \\ \delta_{+}(x) & =\left\{\begin{array}{ll} \infty & \text { if } x=1 \\ x+\frac{1}{255} & \text { if } x<1 \end{array} \quad \delta_{-}(x)= \begin{cases}-\infty & \text { if } x=-1 \\ x-\frac{1}{255} & \text { if } x>-1\end{cases} \right.\quad(13) \end{aligned}$
其中 $D$ 是数据的维度，上标 $i$ 表示提取一个坐标。（我们也可以简单地使用更强大的解码器，如条件自回归模型，但我们将这留给未来的工作。）与 VAE 解码器和自回归模型中使用的离散连续分布类似 $[34, 52]$ ，我们在这里的选择确保变分界是离散数据的无损编码长度，无需向数据添加噪声或将缩放操作的雅可比矩阵合并到对数似然中。在采样结束时，我们无噪声地显示 $\boldsymbol{\mu}_\theta\left(\mathbf{x}_1, 1\right)$ 。

3.4 Simplified training objective

通过上述定义的逆过程和解码器，由公式（12）和（13）导出的变分界对 $\theta$ 是明显可微的，并且准备好用于训练。然而，我们发现对训练样本质量（和更简单的实现）有益的是对以下变分界的变体进行训练：
$L_{\text {simple }}(\theta):=\mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t\right)\right\|^2\right]\quad(14)$
其中 $t$ 在 1 和 $T$ 之间均匀分布。 $t = 1$ 的情况对应于 $L_0$ ，在离散解码器定义（13）中，积分由高斯概率密度函数乘以箱宽近似，忽略了 $\sigma_1^2$ 和边缘效应。 $t > 1$ 的情况对应于方程（12）的未加权版本，类似于 NCSN 去噪评分匹配模型使用的损失加权。（ $L_T$ 不出现，因为前向过程方差 $\beta_t$ 是固定的。）算法 1 显示了使用此简化目标的完整训练过程。

由于我们的简化目标（14）丢弃了公式（12）中的加权，它是一种加权变分界，与标准变分界相比，强调重建的不同方面。特别是，我们在第 4 节中设置的扩散过程导致简化目标降低了与小 $t$ 对应的损失项权重。这些项训练网络去除非常小量的噪声数据，因此将它们降权是有益的，这样网络就可以将重点放在更大 $t$ 项的更困难的去噪任务上。我们将在我们的实验中看到，这种重新加权导致更好的样本质量。

继续推导论文中的式(8)。
首先，给出多元高斯分布的概率密度函数公式：
$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{D / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$
其中 $\mathbf{x}\left(x_1, x_2, \ldots, x_D\right)$ 是 $D$ 维随机变量， $\boldsymbol{\mu}$ 为均值向量， $\boldsymbol{\Sigma}$ 为协方差矩阵，注意 $|\boldsymbol{\Sigma}|$ 为协方差矩阵的行列式。
为了简化推导，用 $q$ 代表 $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) ， p$ 代表 $p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ ，设 $p=\mathcal{N}\left(\boldsymbol{\mu}_p, \Sigma_p\right)$ ， $q=\mathcal{N}\left(\mu_q, \Sigma_q\right)$ ，维度均为 $D$ ，则根据 KL 散度有:
$\begin{aligned} \mathcal{D}_{K L}(p \| q) & =\mathbb{E}_p[\log (p)-\log (q)] \\ & =\mathbb{E}_p\left[\frac{1}{2} \log \frac{\left|\Sigma_q\right|}{\left|\Sigma_p\right|}-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{\boldsymbol{p}}\right)^T \Sigma_p^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{\boldsymbol{p}}\right)+\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{\boldsymbol{q}}\right)^T \Sigma_q^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{\boldsymbol{q}}\right)\right]\\ &=\frac{1}{2}\mathbb{E}_p\left[\log \frac{\left|\Sigma_q\right|}{\left|\Sigma_p\right|}\right]-\frac{1}{2} \mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_p\right)^T \Sigma_p^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_p\right)\right]+\frac{1}{2} \mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_q\right)^T \Sigma_q^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_q\right)\right]\\ &=\frac{1}{2} \log \frac{\left|\Sigma_q\right|}{\left|\Sigma_p\right|}-\frac{1}{2} \mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_p\right)^T \Sigma_p^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_p\right)\right]+\frac{1}{2} \mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_q\right)^T \Sigma_q^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_q\right)\right] \end{aligned}$
由于在高斯分布下 $\mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_p\right)^T \Sigma_p^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_p\right)\right]=\frac{1}{2} \mathbb{E}_p\left[\operatorname{tr}\left\{\left(\mathbf{x}-\boldsymbol{\mu}_p\right)\left(\mathbf{x}-\boldsymbol{\mu}_p\right)^T \Sigma_p^{-1}\right\}\right]=\frac{1}{2} \operatorname{tr}\left\{\mathbb{E}_p\left[\Sigma_p \Sigma_p^{-1}\right]\right\}=D$ ，因此，
$\begin{aligned} \mathcal{D}_{K L}(p \| q) & =\frac{1}{2} \log \frac{\left|\Sigma_q\right|}{\left|\Sigma_p\right|}-\frac{1}{2} D+\frac{1}{2} \mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_q\right)^T \Sigma_q^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_q\right)\right] \end{aligned}$
需要计算是：
$\begin{aligned} \mathbb{E}_p \left[ (\mathbf{x} - \boldsymbol{\mu}_q)^T \boldsymbol{\Sigma}_y^{-1} (\mathbf{x} - \boldsymbol{\mu}_q) \right]\\ &= \mathbb{E}_p\left[\left(\mathbf{x}-\boldsymbol{\mu}_p\right)+\left(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q\right)\right]^T \boldsymbol{\Sigma}_q^{-1}\left[\left(\mathbf{x}-\boldsymbol{\mu}_p\right)+\left(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q\right)\right]\\ &= \mathbb{E}_p\left[(\mathbf{x} - \boldsymbol{\mu}_p)^T \boldsymbol{\Sigma}_y^{-1} (\mathbf{x} - \boldsymbol{\mu}_p) + (\mathbf{x} - \boldsymbol{\mu}_p)^T \boldsymbol{\Sigma}_q^{-1} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q) + (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^T \boldsymbol{\Sigma}_q^{-1} (\mathbf{x} - \boldsymbol{\mu}_p) + (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^T \boldsymbol{\Sigma}_q^{-1} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)\right]\\ &= \mathbb{E}_p\left[\operatorname{tr}\left((\mathbf{x} - \boldsymbol{\mu}_p)^T \boldsymbol{\Sigma}_y^{-1} (\mathbf{x} - \boldsymbol{\mu}_p)\right)\right]+0+0+(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^T \boldsymbol{\Sigma}_q^{-1} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)\\ &=\operatorname{tr}\left(\boldsymbol{\Sigma}_q^{-1} \mathbb{E}\left[\left(\mathbf{x}-\boldsymbol{\mu}_p\right)\left(\mathbf{x}-\boldsymbol{\mu}_p\right)^T\right]\right)+0+0+(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^T \boldsymbol{\Sigma}_q^{-1} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)\\ &=\operatorname{tr}\left(\boldsymbol{\Sigma}_q^{-1} \boldsymbol{\Sigma}_p\right)+0+0+(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^T \boldsymbol{\Sigma}_q^{-1} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q) \end{aligned}$
因此可得：
$\begin{aligned} \mathcal{D}_{K L}(p \| q) & =\frac{1}{2}\left[\log \frac{\left|\boldsymbol{\Sigma}_q\right|}{\left|\boldsymbol{\Sigma}_p\right|}-D+\operatorname{tr}\left(\boldsymbol{\Sigma}_q^{-1} \boldsymbol{\Sigma}_p\right)+\left(\boldsymbol{\mu}_q-\boldsymbol{\mu}_p\right)^T \boldsymbol{\Sigma}_q^{-1}\left(\boldsymbol{\mu}_q-\boldsymbol{\mu}_p\right)\right]\\ &=\frac{1}{2}\left(\boldsymbol{\mu}_{\boldsymbol{q}}-\boldsymbol{\mu}_{\boldsymbol{p}}\right)^T \Sigma_q^{-1}\left(\boldsymbol{\mu}_{\boldsymbol{q}}-\boldsymbol{\mu}_{\boldsymbol{p}}\right)+C \end{aligned}$
重写 $L_t$ :
$\begin{aligned} L_{t-1} & =\mathbb{E}_q\left[\mathcal{D}_{K L}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)\right] \\ & =\mathbb{E}_q\left[\frac{1}{2}\left(\boldsymbol{\mu}_q-\boldsymbol{\mu}_p\right)^T \Sigma_q^{-1}\left(\boldsymbol{\mu}_q-\boldsymbol{\mu}_p\right)\right]+C \\ & =\mathbb{E}_q\left[\frac{1}{2}\left(\tilde{\boldsymbol{\mu}_t}-\boldsymbol{\mu}_\theta\right)^T\left(\tilde{\beta}_t \mathbf{I}\right)^{-1}\left(\tilde{\boldsymbol{\mu}_t}-\boldsymbol{\mu}_\theta\right)\right]+C \\ & =\mathbb{E}_q\left[\frac{1}{2 \tilde{\beta}_t}\left\|\tilde{\boldsymbol{\mu}_t}-\boldsymbol{\mu}_\theta\right\|^2\right]+C \\ & =\mathbb{E}_q\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}_t}-\boldsymbol{\mu}_\theta\right\|^2\right]+C \end{aligned}$
就这样，推出了式(8)。