最新视频合成后调优技术ExVideo模型部署

news2025/4/15 1:10:03

ExVideo是一种新型的视频合成模型后调优技术，由华东师范大学和阿里巴巴的研究人员共同开发。

ExVideo提出了一种新的后调优策略，无需对整个模型进行大规模重训，仅通过对模型中时序相关组件的微调，就能够显著增强其生成更长视频片段的能力，大大降低了对计算资源的需求，仅需1.5kgpu小时就能将视频生成帧数提高至原模型的5倍。

ExVideo在提升视频长度的同时，并没有牺牲模型的泛化能力，生成的视频在风格和分辨率上依然具有多样性。

该技术还采用了多种工程优化技术，比如参数冻结、混合精度训练、梯度检查点技术和Flash Attention，以及使用DeepSpeed库来分片优化器状态和梯度，从而在有限的计算资源下高效训练。

github项目地址：https://github.com/modelscope/DiffSynth-Studio.git。

一、环境安装

1、python环境

建议安装python版本在3.10以上。

2、pip库安装

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3、ExVideo-SVD-128f模型下载：

git lfs install

git clone https://www.modelscope.cn/ECNU-CILab/ExVideo-SVD-128f-v1.git

4、HunyuanDiT模型下载：

git lfs install

git clone https://www.modelscope.cn/api/v1/models/modelscope/HunyuanDiT.git

5、stable-video-diffusion模型下载：

git lfs install

git clone https://www.modelscope.cn/api/v1/models/AI-ModelScope/stable-video-diffusion-img2vid-xt.git

二、功能测试

1、运行测试：

（1）python代码调用测试

import os
import torch
from diffsynth import save_video, ModelManager, SVDVideoPipeline, HunyuanDiTImagePipeline, download_models

def generate_image():
    # Set environment variables for better performance
    os.environ["TOKENIZERS_PARALLELISM"] = "True"

    # Download necessary models
    download_models(["HunyuanDiT"])
    
    # Initialize ModelManager with required models
    model_manager = ModelManager(
        torch_dtype=torch.float16, 
        device="cuda", 
        file_path_list=[
            "models/HunyuanDiT/t2i/clip_text_encoder/pytorch_model.bin",
            "models/HunyuanDiT/t2i/mt5/pytorch_model.bin",
            "models/HunyuanDiT/t2i/model/pytorch_model_ema.pt",
            "models/HunyuanDiT/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin",
        ]
    )
    
    # Create image generation pipeline
    pipe = HunyuanDiTImagePipeline.from_model_manager(model_manager)
    
    # Generate and return the image
    torch.manual_seed(0)
    image = pipe(
        prompt="sunset time lapse at the beach with moving clouds and colors in the sky",
        negative_prompt="错误的眼睛，糟糕的人脸，毁容，糟糕的艺术，变形，多余的肢体，模糊的颜色，模糊，重复，病态，残缺，",
        num_inference_steps=50, height=1024, width=1024,
    )

    # Move model to CPU to free up GPU memory
    model_manager.to("cpu")
    return image

def generate_video(image):
    # Download necessary models
    download_models(["stable-video-diffusion-img2vid-xt", "ExVideo-SVD-128f-v1"])
    
    # Initialize ModelManager with required models
    model_manager = ModelManager(
        torch_dtype=torch.float16,
        device="cuda",
        file_path_list=[
            "models/stable_video_diffusion/svd_xt.safetensors",
            "models/stable_video_diffusion/model.fp16.safetensors",
        ]
    )
    
    # Create video generation pipeline
    pipe = SVDVideoPipeline.from_model_manager(model_manager)
    
    # Generate and return the video
    torch.manual_seed(1)
    video = pipe(
        input_image=image.resize((512, 512)),
        num_frames=128, fps=30, height=512, width=512,
        motion_bucket_id=127,
        num_inference_steps=50,
        min_cfg_scale=2, max_cfg_scale=2, contrast_enhance_scale=1.2
    )

    # Move model to CPU to free up GPU memory
    model_manager.to("cpu")
    return video

def upscale_video(image, video):
    # Download necessary models
    download_models(["stable-video-diffusion-img2vid-xt", "ExVideo-SVD-128f-v1"])
    
    # Initialize ModelManager with required models
    model_manager = ModelManager(
        torch_dtype=torch.float16,
        device="cuda",
        file_path_list=[
            "models/stable_video_diffusion/svd_xt.safetensors",
            "models/stable_video_diffusion/model.fp16.safetensors",
        ]
    )
    
    # Create video upscaling pipeline
    pipe = SVDVideoPipeline.from_model_manager(model_manager)
    
    # Generate and return the upscaled video
    torch.manual_seed(2)
    video = pipe(
        input_image=image.resize((1024, 1024)),
        input_video=[frame.resize((1024, 1024)) for frame in video],
        denoising_strength=0.5,
        num_frames=128, fps=30, height=1024, width=1024,
        motion_bucket_id=127,
        num_inference_steps=25,
        min_cfg_scale=2, max_cfg_scale=2, contrast_enhance_scale=1.2
    )

    # Move model to CPU to free up GPU memory
    model_manager.to("cpu")
    return video

# Main workflow
if __name__ == '__main__':
    # Generate the initial image
    image = generate_image()
    image.save("image.png")

    # Generate a video based on the initial image
    video = generate_video(image)
    save_video(video, "video_512.mp4", fps=30)

    # Optionally upscale the video to higher resolution
    upscaled_video = upscale_video(image, video)
    save_video(upscaled_video, "video_1024.mp4", fps=30)

未完......

更多详细的内容欢迎关注：杰哥新技术