Qwen2.5-VL 开源视觉大模型，模型体验、下载、推理、微调、部署实战

一、Qwen2.5-VL 简介

Qwen2.5-VL，Qwen 模型家族的旗舰视觉语言模型，比 Qwen2-VL 实现了巨大的飞跃。

欢迎访问 Qwen Chat （Qwen Chat）并选择 Qwen2.5-VL-72B-Instruct 进行体验。

1. 主要增强功能

    1）直观地理解事物： Qwen2.5-VL 不仅能熟练识别花、鸟、鱼和昆虫等常见物体，还能高度分析文本、图表、图标、图形和图像中的布局。
    2）具有代理功能： Qwen2.5-VL 可直接扮演视觉代理，能够推理和动态指挥工具，既能在电脑上使用，也能在手机上使用。
    3）理解长视频并捕捉事件： Qwen2.5-VL可以理解1小时以上的视频，这次又增加了通过精确定位相关视频片段来捕捉事件的能力。
    4）不同格式的可视化定位能力： Qwen2.5-VL 可通过生成边框或点来精确定位图像中的对象，并能为坐标和属性提供稳定的 JSON 输出。
    5）生成结构化输出：用于扫描发票、表格、表格等数据。Qwen2.5-VL 支持对其内容进行结构化输出，有利于金融、商业等领域的使用。

2. 相较上一代模型架构更新

1）动态分辨率和帧速率训练，促进视频理解：通过采用动态 FPS 采样，qwen团队将动态分辨率扩展到了时间维度，使模型能够理解各种采样率的视频。相应地，qwen团队在时间维度上对 mRoPE 进行了更新，增加了 ID 和绝对时间对齐，使模型能够学习时间顺序和速度，最终获得精确定位特定时刻的能力。
2）精简高效的视觉编码器：qwen团队通过在 ViT 中战略性地实施窗口关注，提高了训练和推理速度。通过 SwiGLU 和 RMSNorm 进一步优化了 ViT 架构，使其与 Qwen2.5 LLM 的结构保持一致。

3. 模型地址

阿里在 Hugging Face 和 ModelScope （魔搭社区）上开源了 Qwen2.5-VL 的 Base 和 Instruct 模型，包含 3B、7B 和 72B 在内的 3 个模型尺寸。

4. 相关资料地址

GitHub：https://github.com/QwenLM/Qwen2.5-VL

HuggingFace: https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5

魔搭: ModelScope: https://modelscope.cn/collections/Qwen25-VL-58fbb5d31f1d47

模型体验：https://chat.qwenlm.ai/

如果你的电脑硬件不支持，那么可以直接使用官方的免费平台来使用。免费平台是共享 GPU，有额度限制。唯一的好处可以直接免费使用 Qwen 2.5 VL 最强的 78B 模型！

阿里云帮助中心：大模型服务平台百炼(Model Studio)（https://help.aliyun.com/zh/model-studio/user-guide/vision?spm=a2c4g.11186623.4.2.14014422Fom0Ne&scm=20140722.H_2845871._.ID_2845871-OR_rec-V_1#7a7077f8a9r6o）

vllm官方文档：https://docs.vllm.ai/en/latest/models/engine_args.html

二、Quick Start 快速入门

1. Hugging Face

Qwen2.5-VL 的预训练模型检查点已经上传 Hugging Face 的模型中心（Model Hub）上，可以通过transformers 库进行调用。

pip install git+https://github.com/huggingface/transformers accelerate

千问团队同时提供了一个toolkit帮助更加方便的处理各种图形输入。

pip install qwen-vl-utils[decord]

#安装 flash-attn后，即可使用推荐的被注释掉的模型加载代码：
pip install flash-attn --no-build-isolation

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

运行有一个warning:

# default processor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", use_fast=True)

在 Hugging Face 的 transformers 库中，快速版本的分词器（Fast）和慢速版本的分词器（Python）在处理方式上有以下主要区别：

1. 实现方式

快速版本（Fast）：基于 Rust 语言实现，通常具有更高的性能和效率

慢速版本（Python）：基于 Python 实现，性能相对较低，但更易于调试和扩展

2. 加载方式

快速版本：通过设置 use_fast=True 参数加载。如果模型支持快速分词器，优先使用快速版本

慢速版本：通过设置 use_fast=False 参数加载。

3. 性能差异

快速版本：在处理大规模数据时，速度更快，内存占用更小，适合生产环境

慢速版本：在处理速度和内存效率上稍逊一筹，但更适合开发和调试阶段

4. 功能差异

快速版本：通常支持更多的高级功能，如并行处理、快速解码等

慢速版本：功能相对基础，但更灵活，允许用户进行自定义扩展

5. 默认行为

从 transformers v4.48 开始，默认行为将改为 use_fast=True，即使模型保存时使用的是慢速分词器。如果需要使用慢速分词器，可以通过显式设置 use_fast=False 来实现。

总结

如果需要高性能和效率，建议使用快速版本（use_fast=True）。

如果需要更高的灵活性或调试方便，可以选择慢速版本（use_fast=False）。

根据你的实际需求选择合适的分词器版本即可。

2. ModelScope

HuggingFace 需要翻墙，国内无法访问的同学，可通过 ModelScope 访问 QWen2.5VL。

QWen2.5VL 3B模型链接：魔搭社区。右上方，查看使用代码：

三、本地部署硬件建议

模型推理硬件建议：
INT4 ： RTX30901，显存24GB，内存32GB，系统盘200GB
INT4 ： RTX40901或RTX3090*2，显存24GB，内存32GB，系统盘200GB
模型微调硬件要求更高。一般不建议个人用户环境使用。

四、Web UI Example

安装 Git 和 Python 环境，笔者用的是 Python 3.10.6 版本【下载】

1. 首先克隆 Qwen2.5-VL GitHub 存储库并导航到项目目录：

git clone https://github.com/QwenLM/Qwen2.5-VL

cd Qwen2.5-VL

2. 使用以下命令安装 Web 应用程序所需的依赖项：

pip install -r requirements_web_demo.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

3. 为确保与 GPU 兼容，请安装支持 CUDA 的最新版本的 PyTorch、TorchVision 和 TorchAudio。即使已经安装了 PyTorch，您在运行 Web 应用程序时也可能会遇到问题，因此最好更新：

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

4. 更新 Gradio 和 Gradio Client 以避免连接和 UI 相关的错误，因为旧版本可能会导致问题：

pip install -U gradio gradio_client

5.安装qwen-vl-utils[decord]工具包

 pip install qwen-vl-utils[decord]==0.0.8

qwen-vl-utils[decord]是Qwen团队提供的一个工具包，帮助我们更方便地处理各种类型的可视输入，就像使用 API 一样。其中包括 base64、URL 以及交错图片和视频。

如果您使用的不是 Linux，您可能无法从 PyPI 安装 decord。在这种情况下，您可以使用 pip install qwen-vl-utils，这会退回到使用 torchvision 进行视频处理。不过，您仍然可以从源代码中安装 decord，以便在加载视频时使用 decord。

6. 模型的下载安装，共有 3 个选项：

较小的 3B 模型，建议在 GPU 内存有限的笔记本电脑（例如 8GB VRAM）上使用。

python web_demo_mm.py --checkpoint-path "Qwen/Qwen2.5-VL-3B-Instruct"

显存高于 8G 的可以选择 7B 模型，性能更强、效果更好。

python web_demo_mm.py --checkpoint-path "Qwen/Qwen2.5-VL-7B-Instruct"

如果是土豪，手里有专业级别的 GPU，那么可以直接上 72B 的最大模型，效果最佳。

python web_demo_mm.py --checkpoint-path "Qwen/Qwen2.5-VL-72B-Instruct"

执行命令后，首先下载模型，然后加载处理器和模型

两会期间访问hugging face，VPN总掉，笔者尝试从ModelScope下载 3B 模型 (地址：https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct/files)：

下载模型：

pip install modelscope

modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct

运行 web_demo：

python web_demo_mm.py --checkpoint-path "/home/coco/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-3B-Instruct/"

7. 安装完成后在浏览器上打开本地链接 http://127.0.0.1:7860 即可正常使用

五、vLLM 本地部署

1. vLLM

建议使用 vLLM 快速部署 Qwen2.5-VL，以及进行推理。vllm版本需要大于0.7.2。

更多信息可查阅：
vLLM official documentation

1）安装所需包。运行下面的指令：

pip install git+https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775 
pip install accelerate 
pip install qwen-vl-utils 
pip install 'vllm>0.7.2'

pip install flash-attn --no-build-isolation

第一条命令，等价于:

clone https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775

cd /home/coco/my_project/transformers-main/

pip install .

2）本地推理

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

image_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "What is the text in the illustrate?"},
        ],
    },
]


# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",
video_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
            {"type": "text", "text": "请用表格总结一下视频中的商品特点"},
            {
                "type": "video", 
                "video": "https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
                "total_pixels": 20480 * 28 * 28, "min_pixels": 16 * 28 * 28
            }
        ]
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,

    # FPS will be returned in video_kwargs
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

本地一张4090，跑3B模型，报：cuda out of memory。

网上搜了下，解决方法有如下几个：

设置环境变量：PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True。
降低输入数据的大小：确保图像和视频的分辨率不会过高。
使用量化技术：对模型进行量化，以减少显存占用。
释放未使用的显存：在程序运行结束后调用 torch.cuda.empty_cache()。
调整模型参数：减少 max_model_len 和 limit_mm_per_prompt 的值。

调整后代码如下：

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import torch

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

MODEL_PATH = "Qwen/Qwen2.5-VL-3B-Instruct/"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
# 指定大模型生成文本时的行为，温度越低生成的文本越确定，更倾向于选择概率最高的词
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

llm = LLM(model=MODEL_PATH,
          limit_mm_per_prompt={"image": 5, "video": 5},  # 减少多模态输入的数量
          max_model_len=512,  # 减少最大序列长度
          gpu_memory_utilization=0.8
          )
prompt = "hello?"
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
# 计算文本输入的 token 数量
text_tokens = tokenizer(text, return_tensors="pt")
text_length = text_tokens.input_ids.shape[-1]
print(f"Text token length: {text_length}")

outputs = llm.generate([text], sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

# 释放 GPU 缓存
del llm
torch.cuda.empty_cache()

代码中识别的是千问logo图片：