classifier guided diffusion model

背景

对于一般的DM（如DDPM， DDIM）的采样过程是直接从一个噪声分布，通过不断采样来生成图片。但这个方法生成的图片类别是随机的，如何生成特定类别的图片呢？这就是classifier guide需要解决的问题。

方法大意

为了实现带类别标签 $y$ 的DM的推导，进行了以下定义
$\begin{aligned} \hat{q}(x_0) &:= q(x_0) \\ \hat{q}(y|x_0) &:= \text{Know labels per sample} \\ \hat{q}(x_{t+1}|x_{t}, y) &:= q(x_{t+1}|x_t) \\ \hat{q}(x_{1:T}|x_0, y)&:= \prod \limits_{t=1}^T\hat{q}(x_t|x_{t-1}, y) \\ \end{aligned} \tag{1}$
虽然上式定义了以 $y$ 为条件的噪声过程 $\hat{q}$ ，但我们还可以证明当 $\hat{q}$ 不以 $y$ 为条件时的行为与 $q$ 完全相同，即
$\begin{aligned} \hat{q}(x_{t+1}|x_t) &= \int_y \hat{q}(x_{t+1}, y| x_t)dy \\ &= \int_y \hat{q}(x_{t+1}|x_t, y)\hat{q}(y|x_t)dy \\ &= \int_y q(x_{t+1}|x_t)\hat{q}(y|x_t)dy \\ &= q(x_{t+1}|x_t) \int_y \hat{q}(y|x_t)dy \\ &= q(x_{t+1}|x_t) \\ &= \hat{q}(x_{t+1}|x_t, y) \\ \end{aligned}\tag{2}$
同样的思路：
$\begin{aligned} \hat{q}(x_{1:T}|x_0) &= \int_y \hat{q}(x_{1:T}, y|x_0) d_y \\ &= \int_y \hat{q}(x_{1:T}|y, x_0)q(y| x_0) d_y \\ &= \int_y \prod \limits_{t=1}^T \underbrace{ \hat{q}(x_t|x_{t-1}, y)}_{q(x_t|x_t-1)} q(y| x_0) d_y \\ &= \underbrace{\prod \limits_{t=1}^Tq(x_t|x_{t-1})}_{q(x_{1:T}|x_0)} \underbrace{\int_y q(y| x_0)d_y}_{=1} \\ &= q(x_{1:T}|x_0) \end{aligned}\tag{3}$
根据上式同样可以推导出
$\begin{aligned} \hat{q}(x_t) &= \int_{x_{0:t - 1}} \hat{q}(x_0, \cdots, x_t)dx_{0:t-1} \\ &= \int_{x_{0:t - 1}} \underbrace{\hat{q}(x_0)}_{q(x_0)} \underbrace{\hat{q}(x_1, \cdots, x_t|x_0)}_{q(x_{1:T}|x_0)}dx_{0:t-1} \\ &= q(x_t) \end{aligned} \tag{4}$
由上述推导可见带条件的DM的前向过程与DDPM完全相同。并且根据贝叶斯公式,不带逆向过程也满足
$\hat{p}(x_t|x_{t+1}) = p(x_t|x_{t+1}) \tag{5}$
与此同时我们可以证明分类分布 $\hat{q}(y|x_t)$ 只和当前时刻的输入 $x_t$ 有关，与 $x_{t+1}$ 无关
$\begin{aligned} \hat{q}(y|x_t, x_{t+1}) & = \frac{ \overbrace{ \hat{q}(x_{t+1}|x_t, y)}^{\hat{q}(x_{t+1}|x_t)} \hat{q}(y|x_t) } {\hat{q}(x_{t+1}|x_t )} \\ & = \hat{q}(y|x_t) \end{aligned} \tag{6}$

基于条件的去噪过程

将带类别信息的去噪过程定义为 $\hat{p}(x_t|x_{t+1}, y)$

$\begin{aligned} \hat{p} (x_t| x_{t+1}, y) & = \frac{\hat{p} (x_t, x_{t+1}, y) }{\hat{p} (y|x_{t+1}) \hat{p} (x_{t+1}) } \\ & = \frac{\hat{p} (x_t, y | x_{t+1}) }{\hat{p} (y|x_{t+1}) } \\ & = \frac{\overbrace{\hat{p} (y|x_t, x_{t+1})}^{\hat{p}(y|x_t)} \overbrace{\hat{p}(x_t | x_{t+1})}^{p(x_t|x_{t+1})} }{\hat{p} (y|x_{t+1}) } \\ & = \frac{\hat{p} (y|x_t) p(x_t | x_{t+1}) }{\hat{p} (y|x_{t+1}) } \end{aligned} \tag{7}$
由于 $x_{t+1}$ 是已知的， $\hat{p} (y|x_{t+1})$ 这个概率分布与 $x_t$ 无关，可以将 $\hat{p} (y|x_{t+1})$ 视为常数 $Z$ 。此时上式可以表述为
$\hat{p} (x_t| x_{t+1}, y) = Z \hat{p} (y|x_t) p(x_t | x_{t+1}) \tag{8}$
上式的右边第二项 $\hat{p} (y|x_t)$ 很容易得到，我们可以根据 $x_t, y$ 的pair对训练一个分类模型 $\hat{p}_\phi(y|x_t)$

上式的右边第三项 $p(x_t | x_{t+1})$ 在DDPM中也能够通过一个neural network进行估计 $p(x_t | x_{t+1}) \approx p_\theta(x_t|x_{t+1})$

故采样分布
$\begin{aligned} \hat{p} (x_t| x_{t+1}, y) &\approx \hat{p}_{\phi, \theta} (x_t| x_{t+1}, y) \\ &= Z \hat{p}_{\phi} (y|x_t) p_{\theta}(x_t | x_{t+1}) \end{aligned} \tag{9}$
下面来看有了上面这个式子如何进行采样

直接对上面的式子进行采样是很难解决的。论文参考文献¹将上式近似为perturbed Gaussian distribution。

根据前文DM的推导可知 $p_{\theta}(x_t | x_{t+1}) = \mathcal{N}(\mu, \Sigma)=\frac{1}{\sqrt{2\pi} \sqrt{\Sigma} } \exp \left ({- \frac{(x - \mu)^2}{2\Sigma}} \right)$ ，对其取对数
$\log p_{\theta}(x_t|x_{t+1}) = - \frac{1}{2} (x_t - \mu)^T \Sigma^{-1} (x_t - \mu) + C \tag{10}$
对于 $\log \hat{p}_{\phi} (y|x_t)$ 作者假设其curvature比 $\Sigma^{-1}$ 低。这个假设是合理的，对于当diffusion steps足够大时， $\parallel \Sigma \parallel \rightarrow 0$ 。在该情况下，对 $\log\hat{p}_{\phi} (y|x_t)$ 在 $x_t = \mu$ 处进行泰勒展开
$\begin{aligned} \log \hat{p}_{\phi} (y|x_t) & \approx \log \hat{p}_{\phi} (y|x_t) | _{x_t = \mu} + (x_t - \mu) \nabla_{x_t} \log p_{\phi} (y|x_t)|_{x_t = \mu} \\ &= (x_t - \mu) g + C_1 \\ \text{where: } g &= \nabla_{x_t} \log p_{\phi} (y|x_t)|_{x_t = \mu}, C_1\text{ is a contant.} \end{aligned} \tag{11}$

$\begin{aligned} \log (\hat{p}_{\phi} (y|x_t) p_{\theta}(x_t | x_{t+1})) & = - \frac{1}{2} (x_t - \mu)^T \Sigma^{-1} (x_t - \mu) + (x_t - \mu) g + C_2 \\ & = - \frac{1}{2} (x_t - \mu - \Sigma g)^T \Sigma^{-1} (x_t - \mu- \Sigma g) + \frac{1}{2}g^T\Sigma g + C_2 \\ & = - \frac{1}{2} (x_t - \mu - \Sigma g)^T \Sigma^{-1} (x_t - \mu- \Sigma g) + C_3 \\ & = \log p(z) + C_4, z \sim \mathcal{N}(\mu + \Sigma g, \Sigma) \end{aligned} \tag{12}$

（附录给出了验证性证明）

通过上述推导，我们得到了带类别条件的采样过程也可以用高斯分布来近似，只是均值需要加上 $\Sigma g$ 。具体的算法如下
在这里插入图片描述

代码实现

p_mean_var_ddpm是DDPM对高斯分布均值、方差的计算函数

p_mean_var_ddpm_with_classifier是引入类别控制后的对高斯分布均值、方差的计算函数

有了均值方差就可以进行采样了

def p_mean_var_ddpm(self, noise_model, x, t):
    """
    Math:
    \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} x_t -
        \frac{1 - \alpha_t }{\sqrt{\alpha_t}\sqrt{1 - \overline{\alpha}_t}}f_\theta(x_t, t) \tag{30}
    """
    betas_t = extract(self.betas, t, x.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        self.sqrt_one_minus_alphas_cumprod, t, x.shape
    )
    sqrt_recip_alphas_t = extract(self.sqrt_recip_alphas, t, x.shape)
    model_mean_t = sqrt_recip_alphas_t * (
        x - betas_t * noise_model(x, t) / sqrt_one_minus_alphas_cumprod_t
    )
    posterior_variance_t = extract(self.posterior_variance, t, x.shape)
    return model_mean_t, posterior_variance_t

  
def p_mean_var_ddpm_with_classifier(classifier, noise_model, x, t, y=None, cfs=1):
    def cond_fn(x: torch.Tensor, t: torch.Tensor, y: torch.Tensor): 
        assert y is not None
        with torch.enable_grad():
            x_in = x.detach().requires_grad_(True)
            logits = classifier(x_in, t)
            log_probs = F.log_softmax(logits, dim=-1)
            selected = log_probs[range(len(logits)), y.view(-1)]
            return torch.autograd.grad(selected.sum(), x_in)[0].float()   # gradient descend
    grad = cond_fn(x_temp, t, y=y) * cfs 
    model_mean_t, posterior_variance_t = p_mean_var_ddpm(noise_model, x, t)
    new_mean = model_mean_t + posterior_variance_t * grad
    return new_mean, posterior_variance_t

DDIM 中基于条件的去噪过程

上述条件抽样推导仅对随机扩散采样过程有效，不能应用于DDIM²等确定性采样方法(因为DDIM中设定了方差为0，故无法推导出式19)。为此，作者在研究中采用score-based的思路，参考了Song等人[^ 3]的方法，并利用了扩散模型和score matching之间的联系³。

首先根据贝叶斯公式
$\begin{aligned} p (x_t| y) & = \frac{p (y|x_t) p(x_t) }{p (y) } \\ \Rightarrow \log{p (x_t| y) } &= \log{p (y|x_t)} + \log{p(x_t)} - \log{p (y) } \\ \stackrel{对x_t求导} \Rightarrow \nabla_{x_t}\log{p (x_t|y)} &= \nabla_{x_t}\log{p (y|x_t)} + \nabla_{x_t}\log{p(x_t)} - \underbrace{\nabla_{x_t}\log{p(y) }}_{=0} \\ \Rightarrow \nabla_{x_t}\log{p(x_t| y)} &= \nabla_{x_t}\log{p(y|x_t)} + \nabla_{x_t}\log{p(x_t)} \\ \end{aligned} \tag{13}$
具体来说，如果我们有一个模型 $\epsilon_\theta(x_t)$ 来预测添加到样本中的噪声，那么可以利用它来推导出一个score function:
$\nabla_{x_t} \log p_\theta (x_t) = - \frac{1}{\sqrt{1 - \overline{\alpha}_t}} \epsilon_\theta(x_t) \tag{14}$
代入式(20)得
$\begin{aligned} \nabla_{x_t}\log{p(x_t| y)} &= \nabla_{x_t}\log{p(y|x_t)} - \frac{1}{\sqrt{1 - \overline{\alpha}_t}} \epsilon_\theta(x_t) \\ \Rightarrow \sqrt{1 - \overline{\alpha}_t} \nabla_{x_t}\log{p(x_t| y)} &= \sqrt{1 - \overline{\alpha}_t} \nabla_{x_t}\log{p(y|x_t)} - \epsilon_\theta(x_t) \end{aligned} \tag{15}$
定义在条件 $y$ 下的估计噪声 $\hat{\epsilon}(x_t|y)$ 为：
$\hat{\epsilon}(x_t|y) := \epsilon_\theta(x_t) - \sqrt{1 - \overline{\alpha}_t}\nabla_{x_t} \log{p_\phi(y|x_t)} \tag{16}$
只需将DDIM中的$ \epsilon_\theta(x_t) $替换为$ \hat{\epsilon}(x_t|y)$就得到了基于条件的去噪过程。

在这里插入图片描述

代码上也很直观

def p_sample_ddim(self, model, x, t):
    """
    x_{t-1} &=  \sqrt{\overline{\alpha}_{t-1}} \frac{x_t - \sqrt{1 - \overline{\alpha}_{t}}\boldsymbol{\epsilon}_\theta(x_t, t)}
        {\sqrt{\overline{\alpha}_{t}}} +  \sqrt{1 - \overline{\alpha}_{t-1} } \boldsymbol{\epsilon}_\theta(x_t, t)
    """
    sqrt_alphas_cumprod_prev_t = extract(self.sqrt_alphas_cumprod_prev, t, x.shape) 
    sqrt_one_minus_alphas_cumprod_t = extract(self.sqrt_one_minus_alphas_cumprod, t, x.shape)
    sqrt_one_minus_alphas_cumprod_prev_t = extract(self.sqrt_one_minus_alphas_cumprod_prev, t, x.shape) 
    sqrt_alphas_cumprod_t = extract(self.sqrt_alphas_cumprod, t, x.shape) 
    pred_noise = model(x, t)
    pred_x0 = sqrt_alphas_cumprod_prev_t * (x - sqrt_one_minus_alphas_cumprod_t * pred_noise) / sqrt_alphas_cumprod_t
    x0_direction = sqrt_one_minus_alphas_cumprod_prev_t * pred_noise 
    return pred_x0 + x0_direction
  
  
def p_sample_with_classifier(self, model, x, t, t_index, y=None, **kwargs):
    if y is None:
        return self.p_sample_ddim(model, x, t, t_index=t_index)
    cfs = kwargs.get("cfs", 1) 
    sqrt_alphas_cumprod_prev_t = extract(self.sqrt_alphas_cumprod_prev, t, x.shape) 
    sqrt_one_minus_alphas_cumprod_t = extract(self.sqrt_one_minus_alphas_cumprod, t, x.shape)
    sqrt_one_minus_alphas_cumprod_prev_t = extract(self.sqrt_one_minus_alphas_cumprod_prev, t, x.shape) 
    sqrt_alphas_cumprod_t = extract(self.sqrt_alphas_cumprod, t, x.shape) 
    pred_noise = model(x, t)
    score = self.cond_fn(x, t, y=y) * cfs
    pred_noise = pred_noise - sqrt_one_minus_alphas_cumprod_t * score  # update noise 
    pred_x0 = sqrt_alphas_cumprod_prev_t * (x - sqrt_one_minus_alphas_cumprod_t * pred_noise) / sqrt_alphas_cumprod_t
    x0_direction = sqrt_one_minus_alphas_cumprod_prev_t * pred_noise 
    return pred_x0 + x0_direction

一些细节

classifier的训练

classifier的训练与扩散模型的训练可以是独立的。在训练classifier的时候可以噪声预测模型(Unet)的encode部分作为主干，在后面接了一个分类层。并且需要与相应的扩散模型相同的噪声分布对classifier进行训练。训练数据集如 $x_1^t,t, y_1), (x_2^t,t, y_2), ..., (x_N^t,t, y_N)]$ 。 $t$ 是对时间步的采样， $x^t$ 是 $x$ 在时间步 $t$ 的输出。训练完成后，采用上面的算法集成到采样过程中。

gradient score的作用

在上面的采样算法我们看到有一个gradient scale $s$ 来对梯度进行拉伸。

实验视角

一般来说当 $s = 1$ 时，大约能保证生成的图片50%是想要的类别⁴，随着 $s$ 的增大，这个比例也能够增加。如下图，当 $s$ 增加到10，此时生成的图片都是期望的类别。因此 $s$ 也称之为guidance scale。
在这里插入图片描述

其实理解这个scale还有另一个视角

$s\nabla_{x_t} \log (p_\phi(y|x_t)) = \nabla_{x_t} \log (p_\phi(y|x_t)^s)$ ，当 $s > 1$ 他相当于对分布 $p_\phi(y|x_t)$ 进行了一个指数拉升，从而带来更大的梯度更新收益。

根据DM的采样过程，当没有classifier guided时，在时刻 $t$ ,的采样过程应当是
$\begin{aligned} x_{t-1} &= \mu_{\theta}(x_t, t) + \sigma(t) \epsilon,其中 \epsilon \in \mathcal{N}(\epsilon; 0, \textbf{I}) \\ & = \underbrace{\frac{1}{\sqrt{\alpha_t}} (x_t - \frac{1 - \alpha_t }{\sqrt{1 - \overline{\alpha}_t}}\epsilon_\theta(x_t, t))}_{\mu_\theta(x_t, t)} + \sigma(t) \epsilon \end{aligned} \tag{17}$
当加了classifier guided相当于将 $\mu_{\theta}(x_t, t)$ 向预测类别为 $y$ 的方向更新了一小步。 $s$ 是控制更新的幅值。
$\begin{align} x_{t-1} &=& \mu_{\theta}(x_t, t) + s\nabla_{x_t} \log p_{\phi} (y|x_t)|_{x_t = \mu_{\theta}(x_t, t)} + \sigma(t) \epsilon,其中 \epsilon \in \mathcal{N}(\epsilon; 0, \textbf{I}) \tag{18} \end{align}$

参考文献

附录

式12推导验证
$\begin{align*} &- \frac{1}{2} (x_t - \mu - \Sigma g)^T \Sigma^{-1} (x_t - \mu- \Sigma g) + \frac{1}{2}g^T\Sigma g + C_2 \\ = &- \frac{1}{2} (x_t^T - \mu^T - g^T \Sigma^T) \Sigma^{-1} (x_t - \mu - \Sigma g) + \frac{1}{2}g^T\Sigma g + C_2 \\ = &- \frac{1}{2} (x_t^T - \mu^T - g^T \Sigma^T) \Sigma^{-1} (x_t - \mu - \Sigma g) + \frac{1}{2}g^T\Sigma g + C_2 \\ \\ = & - \frac{1}{2} (x_t^T \Sigma^{-1} - \mu^T \Sigma^{-1} - \underbrace{g^T \Sigma^T \Sigma^{-1}}_{g^T} )(x_t - \mu - \Sigma g) + \frac{1}{2}g^T\Sigma g + C_2 \\ = & - \frac{1}{2} (x_t^T \Sigma^{-1} (x_t - \mu - \Sigma g) - \mu^T \Sigma^{-1} (x_t - \mu - \Sigma g) - g^T (x_t - \mu - \Sigma g)) + \frac{1}{2}g^T\Sigma g + C_2 \\ = & - \frac{1}{2} \underbrace{(x_t^T \Sigma^{-1} (x_t - \mu ) - \mu^T \Sigma^{-1} (x_t - \mu))}_{(x_t - \mu)^T \Sigma^{-1} (x_t - \mu)} - \frac{1}{2} ( - g^T (x_t - \mu - \Sigma g) + \underbrace{(- x_t^T \Sigma^{-1}\Sigma g)}_{-x_t^Tg} + \underbrace{\mu^T \Sigma^{-1}\Sigma g}_{\mu^Tg}) + \frac{1}{2}g^T\Sigma g + C_2 \\ = & - \frac{1}{2} (x_t - \mu)^T \Sigma^{-1} (x_t - \mu) + (x_t - \mu) g + C_2 \\ \end{align*}$

Deep unsupervised learning using nonequilibrium thermodynamics ↩︎
[Denoising Diffusion Implicit Models (DDIM) Sampling](https://arxiv.org/abs/2010.02502) ↩︎
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020. ↩︎
Diffusion Models Beat GANs on Image Synthesis ↩︎