【AI 绘画】文生图图生图（基于diffusers）

news2025/7/7 7:31:28

AI 绘画- 文生图&图生图（基于diffusers）

1. 效果展示

本次测试主要结果展示如下：

SDXL文生图

在这里插入图片描述

可爱Lora
在这里插入图片描述

2. 基本原理

模型基本原理介绍如下

stable diffusion首先训练一个自编码器，学习将图像数据压缩为低维表示。通过使用训练过的编码器E，可以将全尺寸图像编码为低维潜在数据(压缩数据)。然后通过使用经过训练的解码器D，将潜在数据解码回图像。而扩散过程就是在低维的潜在空间完成，这也是为什么它比纯扩散模型更快。

模型基本构成:clip+vae + unet(扩散模型)

U-Net网络负责预测噪声，不断优化生成过程，在预测噪声的同时不断注入文本语义信息。而schedule算法对每次U-Net预测的噪声进行优化处理（动态调整预测的噪声，控制U-Net预测噪声的强度），从而统筹生成过程的进度。在SD中，U-Net的迭代优化步数大概是50或者100次，在这个过程中Latent Feature的质量不断的变好（纯噪声减少，图像语义信息增加，文本语义信息增加）。U-Net网络和Schedule算法的工作完成以后，SD模型会将优化迭代后的Latent Feature输入到图像解码器（VAE Decoder）中，将Latent Feature重建成像素级图像。
主要流程结构：

3. 环境安装

diffusers是Hugging Face推出的一个diffusion库，它提供了简单方便的diffusion推理训练pipe，同时拥有一个模型和数据社区，代码可以像torchhub一样直接从指定的仓库去调用别人上传的数据集和pretrain checkpoint。除此之外，安装方便，代码结构清晰，注释齐全，二次开发会十分有效率。

# pip
pip install --upgrade diffusers[torch]
# conda
conda install -c conda-forge diffusers

4. 代码实现

主要测试代码：

文生图


from diffusers import DiffusionPipeline
import torch


# load both base & refiner
# stabilityai/stable-diffusion-xl-base-1.0
base = DiffusionPipeline.from_pretrained(
    "../data/data282269/",device_map=None,torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)


base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

# Define how many steps and what % of steps to be run on each experts (80/20) here
n_steps = 40
high_noise_frac = 0.8

prompt = "A majestic lion jumping from a big stone at night"#"A majestic lion jumping from a big stone at night"
negative_prompt = "(worst quality:2),(low quality:2),(normal quality:2),(greyscale:2),watermark, lowres, bad anatomy, error, missing fingers, extra digit, fewer digits, cropped"
# run both experts
image = base(
    prompt=prompt,
    negative_prompt = negative_prompt,
    num_inference_steps=n_steps,
  #  denoising_end=high_noise_frac,
    output_type="latent",
).images

image = refiner(
    prompt=prompt,
    negative_prompt = negative_prompt,
    num_inference_steps=n_steps,
    denoising_start=high_noise_frac,
    image=image,
).images[0]

image.save("output.png")

图生图

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

# prepare image
url = "./work/in.jpg"
init_image = load_image(url)

prompt = "A gril in a jungle, muted colors, detailed, 8k"

# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image, num_inference_steps=30,strength=0.8).images[0] #,strength表示重绘强度
#make_image_grid([init_image, image], rows=1, cols=2)


image.save("img2img1.png")

附加Loral

加载多个loral，可用set_adapters设置不同权重

from diffusers import DiffusionPipeline
import torch

# load both base & refiner
base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
base.to("cuda")

base.load_lora_weights("/home/adam.wang/model/sd", weight_name="cute2.safetensors", adapter_name="cute")
base.load_lora_weights("/home/adam.wang/model/sd", weight_name="looking_at_viewer.safetensors", adapter_name="looking")

base.set_adapters(["cute", "looking"], adapter_weights=[1.0, 0.8])
# Define how many steps and what % of steps to be run on each experts (80/20) here
n_steps = 35
high_noise_frac = 0.8

prompt = "blind box,1girl,hanfu,brown eyes,blush,looking at viewer,outdoors,full body,solo,smile,chibi,<lora:blind_box:0.8>,best quality,Bokeh,ultra-fine painting,sharp focus,physically-based rendering,extreme detail description,(EOS R8,50mm,F1.2,8K,RAW photo"#"A majestic lion jumping from a big stone at night"  east dragon,
negative_prompt = "nsfw,logo,text,badhandv4,EasyNegative,ng_deepnegative_v1_75t,rev2-badprompt,verybadimagenegative_v1.3,negative_hand-neg,mutated hands and fingers,poorly drawn face,extra limb,missing limb,disconnected limbs,malformed hands,ugly,"
# run both experts
lora_scale = 0.6 
image = base(
    prompt=prompt,
    cross_attention_kwargs={"scale": lora_scale},
    negative_prompt = negative_prompt,
    num_inference_steps=n_steps,
   # denoising_end=high_noise_frac,
    #output_type="latent",
).images[0]

image.save("t8.png")