Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

news2025/4/7 16:05:42

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
扩散自编码器:面向有意义和可解码的表示
code：https://github.com/phizaz/diffae
A CVPR 2022 (ORAL) paper (paper, site, 5-min video)

Diffusion probabilistic models (DPMs) have achieved remarkable quality in image generation that rivals GANs’. But unlike GANs, DPMs use a set of latent variables that lack semantic meaning and cannot serve as a useful representation for other tasks. This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding. Our key idea is to use a learnable encoder for discovering the high-level semantics, and a DPM as the decoder for modeling the remaining stochastic variations. Our method can encode any image into a two-part latent code where the first part is semantically meaningful and linear, and the second part captures stochastic details, allowing near-exact reconstruction. This capability enables challenging applications that currently foil GAN-based methods, such as attribute manipulation on real images. We also show that this two-level encoding improves denoising efficiency and naturally facilitates various downstream tasks including few-shot conditional sampling. Our novel latent space is more readily discriminative than StyleGAN-W (via inversion) when used to encode real input images.

Figure 1. Attribute manipulation and interpolation on real images.

在这里插入图片描述Figure: Overview of our diffusion autoencoder.
Figure 2: Overview of our diffusion autoencoder.

Autoencoder由4部分组成， $Z_{sem}$ 捕获了高级语义【semantics】, while $X_T$ 捕获了低级的随机变化【stochastic variations / stochastic subcode】, 它俩一起可以精准的decoder回原始图像。

a “semantic” encoder that maps the input image to the semantic subcode ( $X_0 \to Z_{sem}$ ),
a conditional DDIM that acts both as a “stochastic” encoder ( $X_0 \to X_{T}$ )
a decoder ( $\left(\left(\mathbf{z}_{\mathrm{sem}}, \mathbf{x}_{T}\right) \rightarrow X_{0}\right)$ ).
为了从autoencoder采样（为了无条件图像生成）, we fit a latent DDIM to the distribution of $Z_{sem}$ and sample $\left(\mathbf{z}_{\mathrm{sem}}, \mathbf{x}_{T} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\right)$ for decoding.