[读论文]-FreeU: Free Lunch in Diffusion U-Net 提高生成质量

news2024/11/24 10:53:04


In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a “free lunch” that substantially improves the generation quality on the fly.
We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics.
Capitalizing on this discovery, we propose a simple yet effective method—termed “FreeU” — that enhances generation quality without additional training or finetuning.
Our key insight is to strategically re-weight the contributions sourced from the U-Net’s skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture.
Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference.





在图像和视频生成任务上的良好结果表明,我们的FreeU可以很容易地集成到现有的扩散模型中,例如,Stable diffusion, DreamBooth, ModelScope, renderer和ReVersion,只需要几行代码就可以提高生成质量。您所需要的只是在推理过程中调整两个比例因子。




Beyond the application of diffusion models, in this paper, we are interested in investigating the effectiveness of diffusion U-Net for the denoising process.
To better understand the denoising pronizhe cess, we first present a paradigm shift toward the Fourier domain to perspective the generated process of diffusion models, a research area that has received limited prior investigation.
As illustrated in Fig. 2, the uppermost row provides the progressive denoising process, showcasing the generated images across successive iterations.
The subsequent two rows exhibit the associated low-frequency and high-frequency spatial domain information after the inverse Fourier Transform, aligning with each respective step.

Evident from Fig. 2 is the gradual modulation of lowfrequency components, exhibiting a subdued rate of change, while their high-frequency components display more pronounced dynamics throughout the denoising process.   These findings are further corroborated in Fig.3.   This can be intuitively explained: 1) Low-frequency components inherently embody the global structure and characteristics of an image, encompassing global layouts and smooth color.   These components encapsulate the foundational global elements that constitute the image’s essence and representation.   Its rapid alterations are generally unreasonable in denoising processes.  Drastic changes to these components could fundamentally reshape the image’s essence, an outcome typically incompatible with the objectives of denoising processes.  2) Conversely, high-frequency components contain the rapid changes in the images, such as edges and textures.  These finer details are markedly sensitive to noise, often manifesting as random high-frequency information when noise is introduced to an image.  Consequently, denoising processes need to expunge noise while upholding indispensable intricate details.

1) 低频分量固有地体现了图像的全局结构和特征,包括全局布局和平滑的颜色。
2) 相反,高频分量包含图像的快速变化,如边缘和纹理。这些更精细的细节对噪声非常敏感,当噪声被引入图像时,它们通常表现为随机的高频信息。


Figure 3. Relative log amplitudes of Fourier with variations of the backbone scaling factor b. Increasing in b correspondingly results in a suppression of highfrequency components in the images generated by the diffusion model.

In light of these observations between low-frequency and high-frequency components during the denoising process, we extend our investigation to ascertain the specific contributions of the U-Net architecture within the diffusion framework.
In each stage of the U-Net decoder, the skip features from the skip connection and the backbone features are concatenated together.
Our investigation reveals that the main backbone of the U-Net primarily contributes to denoising. Conversely, the skip connections are observed to introduce high-frequency features into the decoder module.
These connections propagate fine-grained semantic information to make it easier to recover the input data.
However, an unintended consequence of this propagation is the potential weakening of the backbone’s inherent denoising capabilities during the inference phase.
This can lead to the generation of abnormal image details, as illustrated in the first row of Fig. 1.



Figure 4.  FreeU Framework.  (a) U-Net Skip Features and Backbone Features.  In U-Net, the skip features and backbone features are concatenated together at each decoding stage.  We apply the FreeU operations during concatenation.  (b) FreeU Operations.  The factor b aims to amplify the backbone feature map x, while factor s is designed to attenuate the skip feature map h
(a) U-Net跳变特征和骨干特征。在U-Net中,跳过特征和骨干特征在每个解码阶段都串联在一起。我们在连接期间应用FreeU操作。




