Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

news2025/1/12 8:50:01

原文链接: Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

High-Resolution Image Synthesis with Latent Diffusion Models

    • 01 The shortcomings of the existing works?
    • 02 What problem is addressed?
    • 03 What are the keys to the solutions?
    • 04 What are the main contributions?
    • 05 Related works?
    • 06 Method descriptions
      • Perceptual Image Compression
      • Latent Diffusion Models
      • Conditioning Mechanisms
    • 07 Results and Comparisons
    • 08 Ablation studies
      • On Perceptual Compression Tradeoffs
    • 09 How this work can be improved?
    • 10 Conclusions

01 The shortcomings of the existing works?

  • Since these diffusion model typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations

02 What problem is addressed?

  • Reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.
  • LDMs achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs

03 What are the keys to the solutions?

  • By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.
  • First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space.
  • For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [69] and enables arbitrary types of token-based conditioning mechanisms.

04 What are the main contributions?

  • In contrast to purely transformer-based approaches [23, 64], our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work. (b) can be efficiently applied to high-resolution synthesis of megapixel images.
  • We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution) and datasets while significantly lowering computational costs.
  • our approach does not require a delicate weighting of reconstruction and generative abilities. This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
  • We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of ∼ 10242 px
  • we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training. We use it to train class-conditional, text-to-image and layout-to-image models.
  • we release pretrained latent diffusion and autoencoding models at https://github. com/CompVis/latent-diffusion which might be reusable for a various tasks besides training of DMs.

05 Related works?

  • Generative Models for Image Synthesis
  • Diffusion Probabilistic Models (DM)
  • Two-Stage Image Synthesis

06 Method descriptions

we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.

Such an approach offers several advantages:

  • By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space.
  • We exploit the inductive bias of DMs inherited from their UNet architecture [69], which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches [23, 64].
  • Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis [25].

Perceptual Image Compression

given an image x ∈ R H × W × 3 x \in R^{H×W×3} xRH×W×3 in RGB space, the encoder E encodes x into a latent representation z = E ( x ) z = E(x) z=E(x), and the decoder D reconstructs the image from the latent, giving ̃ x = D ( z ) = D ( E ( x ) ) x = D(z)=D(E(x)) x=D(z)=D(E(x)), where z ∈ R h × w × c z ∈ R^{h×w×c} zRh×w×c.

Latent Diffusion Models

The corresponding objective in DM can be simplified to:

Compared to the high-dimensional pixel space, latent space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.

Since the forward process is fixed, z t z_t zt can be efficiently obtained from E during training, and samples from p ( z ) p(z) p(z) can be decoded to image space with a single pass through D.

Conditioning Mechanisms

We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the cross-attention mechanism [94], which is effective for learning attention-based models of various input modalities [34,35].

To pre-process y y y from various modalities (such as language prompts) we introduce a domain specific encoder τ θ τ_θ τθ that projects y to an intermediate representation τ θ ( y ) ∈ R M × d τ τ_θ(y) ∈ R^{M×d_τ} τθ(y)RM×dτ , which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing $Attention(Q, K, V )=softmax (\frac{QK^T}{√d}) · V $, with:

See Fig. 3for a visual depiction.

Figure 3. We condition LDMs either via concatenation or by a more general cross-attention mechanism.

Based on image-conditioning pairs, we then learn the conditional LDM via:

This conditioning mechanism is flexible as τ θ τ_θ τθ can be parameterized with domain-specific experts, e.g. (unmasked) transformers [94] when y y y are text prompts.

07 Results and Comparisons

Experimental findings:

  • LDMs trained in VQ-regularized latent spaces achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, cf . Tab. 8 .Therefore, we evaluate VQ-regularized LDMs in the remainder of the paper, unless stated differently.

Table 8. Complete autoencoder zoo trained on OpenImages, evaluated on ImageNet-Val. † denotes an attention-free autoencoder.

Image Generation with Latent Diffusion

  • On CelebA-HQ, we report a new state-of-the-art FID of 5.11, outperforming previous likelihood-based models as well as GANs. We also outperform LSGM [93] where a latent diffusion model is trained jointly together with the first stage.
  • We outperform prior diffusion based approaches on all but the LSUN-Bedrooms dataset, where our score is close to ADM [15], despite utilizing half its parameters and requiring 4-times less train resources
  • LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches.

Table 1. Evaluation metrics for unconditional image synthesis.

In Fig. 4 we also show qualitative results on each dataset.

Figure 4. Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and classconditional ImageNet [12],

Conditional Latent Diffusion

Transformer Encoders for LDMs

For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO [51] validation set, where our model improves upon powerful AR [17, 66] and GAN-based [109] methods, cf . Tab. 2.

Table 2. Evaluation of text-conditional image synthesis on the 256 × 256-sized MS-COCO [51] dataset

Table 3. Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation on ImageNet [12]

Convolutional Sampling Beyond 25 6 2 256^2 2562

Figure 9. A LDM trained on  resolution can generalize to larger resolution (here: 512×1024) for spatially conditioned tasks

Super-Resolution with Latent Diffusion

Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS.

Figure 10. ImageNet 64→256 super-resolution on ImageNet-Val.

Table 5. ×4 upscaling results on ImageNet-Val. ()

Further, we conduct a user study comparing the pixel-baseline with LDM-SR. The results in Tab. 4 affirm the good performance of LDM-SR. PSNR and SSIM can be pushed by using a post-hoc guiding mechanism [15] and we implement this image-based guider via a perceptual loss.

Table 4. Task 1: Subjects were shown ground truth and generated image and asked for preference. Task 2: Subjects had to decide between two generated images.

Inpainting with Latent Diffusion

In particular, we compare the inpainting efficiency of LDM-1 (i.e. a pixel-based conditional DM) with LDM-4, for both KL and VQ regularizations, as well as VQLDM-4 without any attention in the first stage (see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions.

Tab. 6 reports the training and sampling throughput at resolution 2562 and 5122, the total training time in hours per epoch and the FID score on the validation split after six epochs.

Table 6. Assessing inpainting efficiency. †

The comparison with other inpainting approaches in Tab. 7 shows that our model with attention improves the overall image quality as measured by FID over that of [88]

Table 7. Comparison of inpainting performance on 30k crops of size 512 × 512 from test images of Places [108]

Figure 11. Qualitative results on object removal with our big, w/ ft inpainting model.

08 Ablation studies

On Perceptual Compression Tradeoffs

This section analyzes the behavior of our LDMs with different downsampling factors f ∈{1, 2, 4, 8, 16, 32}.

Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section.

Fig. 6 shows sample quality as a function of training progress for 2M steps of class-conditional models on the ImageNet [12] dataset.

Figure 6. Analyzing the training of class-conditional LDMs with different downsampling factors f over 2M train steps on the ImageNet dataset

In Fig. 7, we compare models trained on CelebAHQ [39] and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler [84] and plot it against FID-scores [29]

Figure 7. Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets.

Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, LDM-4 and -8 offer the best conditions for achieving high-quality synthesis results.

09 How this work can be improved?

  • While LDMs significantly reduce computational requirements compared to pixel-based approaches, their sequential sampling process is still slower than that of GANs.
  • The use of LDMs can be questionable when high precision is required.

10 Conclusions

  • We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality.
  • Based on this and our cross-attention conditioning mechanism, our experiments could demonstrate favorable results compared to state-of-the-art methods across a wide range of conditional image synthesis tasks without task-specific architectures.

原文链接: Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/665671.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

RESTful API 中的 HTTP 方法有哪些?

RESTful API 中的 HTTP 方法有哪些? RESTful API 是一种基于 HTTP 协议的 API 设计风格,它使用 HTTP 协议中的方法来定义对资源的操作。在 RESTful API 中,HTTP 方法主要用于表示对资源的操作,例如获取、创建、更新和删除资源。本…

持 PMP®证书增持 CSPM-2证书,怎么弄?

2023年起,持有PMP证书的朋友可以直接增持一个同等级证书CSPM-2,不用重新考试,不用重新学习,原PMP证书不影响正常使用,相当于多了一个国标项目管理领域的证书。 第一步准备资料 1、填写能力评价表(简历和业绩不用填&am…

自毁便签:MaiFile Privnote

自毁便签:MaiFile Privnote为您提供安全可靠的信息传递 在当今数字化时代,隐私保护成为了人们关注的焦点。MaiFile Privnote作为一种创新的在线服务,为用户提供了一种安全、便捷的方式来发送私密信息,确保信息在传递后能够自动销…

SSM校园二手交易微信小程序-计算机毕设 附源码87210

SSM校园二手交易微信小程序 目 录 1 绪论 1.1 研究背景 1.2国内外研究现状 1.3论文结构与章节安排 2 校园二手交易微信小程序系统分析 2.1 可行性分析 2.1.1技术可行性分析 2.1.2经济可行性分析 2.1.3操作可行性分析 2.1.4需求分析 2.2 系统流程分析 2.2.1 数据流程…

【HTTP】HTTP协议 HTTPS 协议

目录 一、HTTP 🌷1、HTTP是什么? 🌷2、HTTP的工作过程 🌷3、 抓包工具Fiddler的使用 🌷4、HTTP的协议格式(重点) 🌷5、HTTP请求 5.1 请求地址URL 5.2 方法method 5.3 认识 …

日志是什么?耗时2个月搞懂Linux日志

这里写目录标题 日志基本介绍日志管理服务日志轮替 日志基本介绍 日志是用来记录重大事件的工具。 日志文件是重要的系统信息文件,其中记录了很多重要的系统事件。包括用户的登录信息,系统的启动信息,系统的安全信息,邮件相关信息…

如何使用电脑压缩jpg图片大小?jpg图片怎么压缩大小kb?

在学习时经常需要将一些高分辨率的图片添加到你的学术论文或者演示文稿中,但是这些高清图片往往会占用大量的存储空间,导致文件变得非常大,难以上传或者发送,所以需要把这些jpg图片压缩(https://www.yasuotu.com/jpg&a…

如何在前端项目中使用icon图标 (iconfont和remixicon以及Materialicon)

1.使用阿里图标库iconfont iconfont链接 把你所需要用的图标加入购物车中 把购物车中的图标加入到项目中 此时有两种使用方 直接下载文件,然后在项目中引入css文件就可以直接使用 2.使用在线链接 注意:在线链接服务仅供平台体验和调试使用,…

定时任务最简单的3种实现方法

定时任务在实际的开发中特别常见,比如电商平台 30 分钟后自动取消未支付的订单,以及凌晨的数据汇总和备份等,都需要借助定时任务来实现,那么我们本文就来看一下定时任务最简单的几种实现方式。 TOP 1:Timer Timer 是…

宽带毫米波数模混合波束赋形

从无线移动通信发展的脉络来看,第1、2 代(1G、2G)先后分别从模拟和数字两种方式解决了人们之间的语音通信需求,第3 代(3G)开始增加对数据业务的支持,第4 代(4G)系统着重满…

一场专属开发者的技术盛宴——华为开发者联创日首站登陆深圳

技术无界,创想无限。2023年6 月 17 日,华为开发者联创日深圳首站成功举办。本次活动由华为技术有限公司主办,深圳市龙岗数字创意产业走廊管理中心协办,并得到了华为全国首个数字创意产业方向的创新中心——华为(龙岗&a…

【历史上的今天】6 月 20 日:MP3 之父出生;富士通成立;谷歌收购 Dropcam

整理 | 王启隆 透过「历史上的今天」,从过去看未来,从现在亦可以改变未来。 今天是 2023 年 6 月 20 日,在 2016 年的今天,中国超级计算机“神威太湖之光”摘得世界冠军。超级计算机,被称为“国之重器”,是…

基于html+css的图展示136

准备项目 项目开发工具 Visual Studio Code 1.44.2 版本: 1.44.2 提交: ff915844119ce9485abfe8aa9076ec76b5300ddd 日期: 2020-04-16T16:36:23.138Z Electron: 7.1.11 Chrome: 78.0.3904.130 Node.js: 12.8.1 V8: 7.8.279.23-electron.0 OS: Windows_NT x64 10.0.19044 项目…

【Python 随练】猴子吃桃问题

题目: 猴子吃桃问题:猴子第一天摘下若干个桃子,当即吃了一半,还不瘾,又多吃了一个。第二天早上又将剩下的桃子吃掉一半,又多吃了一个。以后每天早上都吃了前一天剩下的一半零一个。到第 10 天早上想再吃时…

【CEEMDAN-VMD-GRU】完备集合经验模态分解-变分模态分解-门控循环单元预测研究(Python代码实现)​

💥💥💞💞欢迎来到本博客❤️❤️💥💥 🏆博主优势:🌞🌞🌞博客内容尽量做到思维缜密,逻辑清晰,为了方便读者。 ⛳️座右铭&a…

我敢说,80%项目经理都不知道这3个进度管理方法

早上好,我是老原。 我离开腾讯之后,曾经加入一家互联网创业公司。 要知道,当你在一个大公司的平台上做事做习惯之后,觉得一些流程都应该是严谨的、完备的、按计划进行的。 但是当时,经常出现一个致命问题——进度拖…

如何简单快速地上传三维地形并进行在线查看和分享?

四维轻云是一款地理空间数据在线管理平台,具有地理空间数据的在线管理、查看及分享等功能。在四维轻云平台中,用户可以不受时间地点的限制,随时随地上传数字高程模型、激光点云、倾斜摄影模型、正射影像等地理空间数据。 现在,小…

设计模式-04.01-结构型-代理桥接装饰器适配器

引言 创建型模式比较好理解,后面的结构型和行为型设计模式不是那么好理解。如果遇到不好理解的设计模式,我一般会在开头举比较简单的Demo案例来帮助理解。 代理模式【常用】 前面几节,我们讲了设计模式中的创建型模式。创建型模式主要解决…

数据库中的字段名与实体类中的属性名不能一一对应时的三种处理方式

当查询结果的列名和java对象的属性名对应不上时需要采用下列方式进行处理: 第一种方式:在查询语句中使用关键字"as" 给列起别名 第二种方式:使用resultMap结果映射 第三种方式:开启驼峰命名自动映射(配置set…

TDEngine彻底卸载

TDEngine卸载 一 、导出数据二、卸载软并删除文件2.1 卸载软件2.1.1 rpm方式2.1.2 tar方式2.1.3 deb方式 2.2 删除数据文件 三、TDengine安装及配置 因为需要升级TDengine,所以先卸载原版本的TD库。 一 、导出数据 这一步至关重要,一定要在所有的操作之…