using dapers on diffusers: Dreambooth, Texual Inversion, LoRA and IP-Adapter

参考自：https://huggingface.co/docs/diffusers/using-diffusers/loading_adapters

如今，对于 diffusion 模型，有许多高效的训练技术来微调一个定制化的模型，能够生成指定的物体 (object) 或者指定的风格 (style)。每种不同的训练技术在训练结束后，会得到不同的 adapter 模型。有的 adapter 是一整个完全不同的 diffusion 模型，也有一些 adapter 只是改变了一小部分参数，或者得到了一个 embedding。这意味着，对于这些不同的 adapters，加载的方法是各不相同的。

本文将介绍如何加载 DreamBooth、Textual Inversion、LoRA 和 IP-Adapter 这几种常见的 adpater。

DreamBooth

DreamBooth 会微调整个 diffusion 模型的权重参数，它只需要几张关于某个主题的图像进行训练，就可以产生该主题在其他风格，或其他设定下的图片。推理生图时，该方法需要在 prompt 中使用一个特殊的触发词，该触发词与训练数据中的主题相关联。在本文介绍的所有训练技术中，DreamBooth 训练产生的结果最大（几 GB），因为它需要微调整个 diffusion 模型的参数，所以相当于产出的是一个全新的 diffusion 模型权重。

其实，加载 DreamBooth 已经不能称为加载一个 adapter 了，因为相当于加载了一个全新的 diffusion 模型。

这里，我们以 herge_style 为例，加载一个 DreamBooth 权重。该风格仅使用 10 张 Hergé 所画的图片，即可学习到他的绘画风格，生成改风格的图片。在推理生图时，我们需要再 prompt 使用触发词 herge_style 来触发模型生成这种风格的图片。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("sd-dreambooth-library/herge-style", torch_dtype=torch.float16).to("cuda")
prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration"
image = pipeline(prompt).images[0]
image

在这里插入图片描述

可以看到，出图的风格很鲜明，符合 herge 的绘画风格。

Textual Inversion

Textual inversion 与 DreamBooth 相近，也可以根据少量的图片进行训练，产生特定风格的 concept（包括特定的风格 style 或者特定的物体 object）。其原理是学习训练集图片（风格或物体的） embeeding，并通过扩展 tokenizer 的词表，绑定到一个新的特殊触发词上。在推理生图时，同样需要使用指定的触发词来触发。由于只需要训练并绑定一个 embedding，因此训练产物很小（一般只有几十 KB），并且底模的参数不需要改变。

由于 textual inversion 只是一个新的 embedding，因此它本身不能生图，我们先加载其对应的底模：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")

接着我们通过 load_textual_inversion 方法，加载 textual inversion 的 embedding。这里我们的示例使用的是 sd-concepts-library/gta5-artwork ，注意我们在 prompt 中需要使用对应的触发词 <gta5-artwork> 来表达该 textual inversion 所代表的 GTA5 风格：

pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork")
prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style"
image = pipeline(prompt).images[0]
image

在这里插入图片描述

可以看到，生图的结果却是很有 GTA5 的艺术风格。

除了学习特定的物体或风格，textual inversion 还有一种常见的用法，就是用作 negative prompt。首先收集一堆我们不想要的质量较差的图片（如模糊、多手指等），训练一个 textual inversion embedding 来学习这种我们不想要的图片风格，然后加载这个 embedding，并将触发词放到 negative prompt 中，从而让模型的生图结果不要出现这种低质量的情况。

我们这里的示例使用 sayakpaul/EasyNegative-test 这个训练好的 negative textual inversion embedding。还是用 load_textual_inversion 方法，与之前不同的是，这里还要通过 weight_name 指定 embedding 具体是哪个文件，和通过 token 参数指定触发词：

pipeline.load_textual_inversion(
    "sayakpaul/EasyNegative-test", weight_name="EasyNegative.safetensors", token="EasyNegative"
)

然后将触发词 EasyNegative 添加到 negative_prompt 参数中：

prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, EasyNegative"
negative_prompt = "EasyNegative"

image = pipeline(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0]
image

在这里插入图片描述

LoRA

LoRA的加载与卸载

Low-Rank Adaptation (LoRA) 训练很快，并且权重产物很小（通常在几百 MB），因此是一种非常流行的训练技术。像本文中介绍的其他技术一样，LoRA 可以通过一些图片，训练模型来学习到一种新的风格。其原理是在训练好的 diffusion 模型中插入一些新的权重参数，并在训练时只更新这些新的权重参数，而非整个 diffusion 模型。这就是 LoRA 训练更快，并且需要额外存储的权重很小的原因。

LoRA 是一种通用的训练技术，它可以与一些其他的训练方法配合，比如 DreamBooth + LoRA 就是一种常见的训练范式。

接下来，我们看一下如何加载一个 LoRA 并进行推理生图。

首先加载一个 pipeline：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")

我们先确定一个 prompt，然后看一下底模本身的生成效果：

prompt = "bears, pizza bites"
image = pipeline(prompt).images[0]
image

在这里插入图片描述

然后，我们使用 load_lora_weights 方法。这里，我们加载一个示例 LoRA： ostris/super-cereal-sdxl-lora ，注意需要指定权重的文件名：

pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors")
image = pipeline(prompt).images[0]
image

在这里插入图片描述

可以看到，画风的 LoRA 效果还是很明显，很不错的。大家也可以试一下其他的 prompt，来看一下效果。

上面我们用的 load_lora_weights 方法会试着将 LoRA 的权重同时加载进 unet 和 text encoder，这是一种比较好用的方式，因为它能够同时处理 LoRA 权重没有针对 UNet 和文本编码器的单独标识符和有单独标识符两种情况。

但是，如果你只需要将 LoRA 权重加载进 unet，那么你可以使用 load_attn_procs 方法。这里加载 jbilcke-hf/sdxl-cinematic-1 作为示例：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.unet.load_attn_procs("jbilcke-hf/sdxl-cinematic-1", weight_name="pytorch_lora_weights.safetensors")

# 在prompt中使用`cnmt`来触发LoRA
prompt = "A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration"
image = pipeline(prompt).images[0]
image

对于 load_lora_weights 和 load_attn_procs 两种方法，我们都可以传入 cross_attention_kwargs={"scale": 0.5} 参数来调整我们在多大程度上使用加载的 LoRA 权重。当该值为 0 时，即相当于只用基座模型，当该值为 1 时，相当于完全使用微调过的 LoRA 权重。

如果需要卸载 LoRA 权重，可以使用 unload_lora_weights 方法。该方法能够卸载掉 LoRA 权重，将模型恢复到加载 LoRA 前的原始权重。

pipeline.unload_lora_weights()

如果只想临时禁用或启用 LoRA 的话，还可以使用 pipe.disable_lora() or pipe.enable_lora() 方法。

加载/融合多个LoRA

有时候，同时使用多个 LoRA 可以创造出一些全新的、有趣的结果。fuse_lora() 可以将 LoRA 权重与底模的原始权重融合起来。

将底模和 LoRA 融合起来，可以提高推理生图的速度，因为我们不需要再分别加载底模和 LoRA。在融合后，可以通过 save_pretrained 方法将 pipeline 保存下来，从而避免每次在使用该模型时都需要重复加载底模和 LoRA。

首先，加载底模模型：

from diffusers import StableDiffusionXLPipeline, AutoencoderKL
import torch

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16,
).to("cuda")

然后，加载 LoRA 权重并将其与原始权重融合。lora_scale 参数与上面的 cross_attention_kwargs={"scale": 0.5}类似，用于控制多大程度上融合 LoRA 的权重。融合时一定要主要设置好这个参数，因为融合后就无法使用 cross_attention_kwargs 的scale 参数来控制了。

pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl")
pipeline.fuse_lora(lora_scale=0.7)

如果想要解开融合的 LoRA（比如有时想要重设不同的 lora_scale 参数、或者想要换一个 LoRA），可以使用 unfuse_lora 方法：

pipeline.unfuse_lora()

# 融合一个新的LoRA
pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora")
pipeline.fuse_lora(lora_scale=0.7)

注意如果融合了多个 LoRA 权重，就无法直接这样 unfuse 了，此时只能重新加载一下底模的原始权重。

IP-Adapter

IP-Adapter 是一种有效且轻量的 adapter，它可以为 diffusion model 提供 image prompting 的能力，即不仅参考文本的 prompt，也可以将输入图片作为 prompt。其原理是解耦图像和文本特征的交叉注意力层。IP-Adapter 在训练时，其他组件的参数都是冻结的，仅训练 unet 中的图像特征。因此，IP-Adapter 的训练产物也只有 unet 的部分权重参数，产出文件也不大（一般约为 100MB）。

IP-Adapter 与 diffusers 中的大部分 pipelines 都是适配的，包括 Stable Diffusion, Stable Diffusion XL (SDXL), ControlNet, T2I-Adapter, AnimateDiff。我们可以使用任何基于相同底模微调的不同定制化模型，对于 LCM-Lora，也可以直接用。

首先我们创建一个 SD 的 pipeline：

from diffusers import AutoPipelineForText2Image
import torch
from diffusers.utils import load_image

pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")

然后，通过 load_ip_adapter 方法，加载 h94/IP-Adapter 的权重：

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

IP-Adapter 是要依赖于 image encoder 来产生图片特征的，如果我们的 IP-Adapter 权重中包含了 image_encoder 的子目录的话，image encoder 的权重可以自动加载到 pipeline 中。否则，我们需要加载一个 CLIPVisionModelWithProjection 模型并将它传入到 SD 的 pipeline 中。

from diffusers import AutoPipelineForText2Image, CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
).to("cuda")

pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, torch_dtype=torch.float16).to("cuda")

我们这个示例中，直接加载即可。

有了 IP-Adapter，我们可以同时使用文本条件和图片条件，来进行生图。这里，我们加载一张熊的图片作为图片条件，并通过文本条件让它带上太阳镜：

pipeline.set_ip_adapter_scale(0.6)
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png")
generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
    prompt='best quality, high quality, wearing sunglasses', 
    ip_adapter_image=image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50,
    generator=generator,
).images
images[0]

在这里插入图片描述

我们可以通过 set_ip_adapter_scale 方法来调整文本 prompt 和图片 prompt 的权重系数。如果想要仅使用图片 prompt，可以将该值设为 1.0。如果降低该值，就可以增强生图的多样性，但是与原图的一致性也会降低。一般来说，如果同时使用文本 prompt 和图片 prompt 的话，将该值设置为 0.5 就可以得到不错的结果。

IP-Adapter 同样可以在 img2img 或 inpainting 的 pipeline 中使用，以下分别是代码示例：

# IP-Adapter in img2img
from diffusers import AutoPipelineForImage2Image
import torch
from diffusers.utils import load_image

pipeline = AutoPipelineForImage2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/vermeer.jpg")
ip_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/river.png")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
    prompt='best quality, high quality', 
    image = image,
    ip_adapter_image=ip_image,
    num_inference_steps=50,
    generator=generator,
    strength=0.6,
).images
images[0]

# -----------------------------------------------------------------------------------
# IP-Adapter in inpainting
from diffusers import AutoPipelineForInpaint
import torch
from diffusers.utils import load_image

pipeline = AutoPipelineForInpaint.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float).to("cuda")

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/inpaint_image.png")
mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/mask.png")
ip_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/girl.png")

image = image.resize((512, 768))
mask = mask.resize((512, 768))

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
    prompt='best quality, high quality', 
    image = image,
    mask_image = mask,
    ip_adapter_image=ip_image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50,
    generator=generator,
    strength=0.5,
).images
images[0]

在 SDXL 的 pipeline 中同样可以使用：

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

image = load_image("https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/watercolor_painting.jpeg")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

generator = torch.Generator(device="cpu").manual_seed(33)
image = pipeline(
    prompt="best quality, high quality", 
    ip_adapter_image=image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=25,
    generator=generator,
).images[0]
image.save("sdxl_t2i.png")

在这里插入图片描述

还有在最新的 LCM-Lora 中，使用 IP-Adapter 可以实现 “instant fine-tune”。注意，需要先加载 IP-Adapter 的权重，再加载 LCM-Lora 的权重：

from diffusers import DiffusionPipeline, LCMScheduler
import torch
from diffusers.utils import load_image

model_id =  "sd-dreambooth-library/herge-style"
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "best quality, high quality"
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
images = pipe(
    prompt=prompt,
    ip_adapter_image=image,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]

实际上，只要满足这两个条件的 pipeline 都可以与 IP-Adapter 兼容：

使用文本 prompt
使用 SD / SDXL 底模

加载、使用起来也很简单，只需要在创建 pipeline 之后通过 load_ip_adapter 方法将其载入，然后在生图时将图片作为 ip_adapter_image 参数传入即可。

以下是在另外两个 pipeline：ControlNet 和 AnimateDiff 中使用 IP-Adapter 的示例：

# IP-Adapter in ControlNet
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from diffusers.utils import load_image

controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
pipeline.to("cuda")

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
    prompt='best quality, high quality', 
    image=depth_map,
    ip_adapter_image=image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50,
    generator=generator,
).images
images[0]
# -----------------------------------------------------------------------------------
# IP-Adapter in AnimateDiff
import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif, load_image


adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)

model_id = "Lykon/DreamShaper"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)

scheduler = DDIMScheduler(
    clip_sample=False,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="linear",
    timestep_spacing="trailing",
    steps_offset=1
)
pipe.scheduler = scheduler

pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")
pipe.load_lora_weights("guoyww/animatediff-motion-lora-tilt-up", adapter_name="tilt-up")
pipe.load_lora_weights("guoyww/animatediff-motion-lora-pan-left", adapter_name="pan-left")

seed = 42
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
images = [image] * 3
prompts = ["best quality, high quality"] * 3
negative_prompt = "bad quality, worst quality"
adapter_weights = [[0.75, 0.0, 0.0], [0.0, 0.0, 0.75], [0.0, 0.75, 0.75]]

output_frames = []
for prompt, image, adapter_weight in zip(prompts, images, adapter_weights):
    pipe.set_adapters(["zoom-out", "tilt-up", "pan-left"], adapter_weights=adapter_weight)
    output = pipe(
      prompt= prompt,
      num_frames=16,
      guidance_scale=7.5,
      num_inference_steps=30,
      ip_adapter_image = image,
      generator=torch.Generator("cpu").manual_seed(seed),
    )
    frames = output.frames[0]
    output_frames.extend(frames)

export_to_gif(output_frames, "test_out_animation.gif")