MLM之Qwen：Qwen2-VL的简介、安装和使用方法、案例应用之详细攻略

Qwen2-VL的简介

1、主要增强功能：

2、模型架构更新：

3、性能

图像基准测试

视频基准测试

代理基准测试

多语言基准测试

4、新闻

5、限制

Qwen2-VL的安装和使用方法

1、安装

2、使用方法

(1)、使用Transformers进行聊天

(2)、ModelScope

更多使用提示

提高性能的图像分辨率

添加多个图像输入的ID

添加视觉ID

(4)、试试Qwen2-VL-72B的API！

3、量化

(1)、AWQ

使用Transformers的AWQ量化模型

(2)、GPTQ

使用 GPTQ 模型与 Transformers

4、基准测试

(1)、量化模型的性能

速度基准测试

5、部署

6、训练

LLaMA-Factory

安装

数据准备

训练

7、功能调用

(1)、简单用例：

8、演示

Web UI 示例

安装

使用 FlashAttention-2 运行演示

选择不同的模型（仅限 Qwen2-VL 系列）

定制化

9、Docker

Qwen2-VL的案例应用

Qwen2-VL的简介

2024年8越30日，阿里云重磅发布Qwen2-VL！Qwen2-VL是Qwen模型系列中最新版本的视觉语言模型。Qwen2-VL是由阿里云qwen2团队开发的多模态大型语言模型系列。

GitHub地址：https://github.com/QwenLM/Qwen2-VL

1、主要增强功能：

>> 各种分辨率和比例图像的SoTA理解： Qwen2-VL在视觉理解基准测试中实现了最先进的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。

>> 理解超过20分钟的视频：通过在线流媒体能力，Qwen2-VL可以通过高质量的视频问答、对话、内容创作等方式理解超过20分钟的视频。

>> 可操作手机、机器人等设备的代理：具备复杂推理和决策能力的Qwen2-VL可以集成到如手机、机器人等设备中，基于视觉环境和文本指令自动操作。

>> 多语言支持：为了服务全球用户，除了支持英语和中文外，Qwen2-VL现在还支持图像中不同语言文本的理解，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

2、模型架构更新：

>> 动态分辨率处理：与以往不同，Qwen2-VL可以处理任意图像分辨率，将其映射为动态数量的视觉标记，提供更人性化的视觉处理体验。

>> 多模态旋转位置嵌入（M-ROPE）：将位置嵌入分解为多个部分，以捕捉1D文本、2D视觉和3D视频的位置信息，增强其多模态处理能力。

我们开源了Qwen2-VL-2B和Qwen2-VL-7B，使用Apache 2.0许可证，并发布了Qwen2-VL-72B的API！该开源集成到Hugging Face Transformers、vLLM及其他第三方框架中。希望你喜欢！

3、性能

图像基准测试

Benchmark	Previous SoTA (Open-source LVLM)	Claude-3.5 Sonnet	GPT-4o	Qwen2-VL-72B (Coming soon)	Qwen2-VL-7B (🤗 🤖)	Qwen2-VL-2B (🤗🤖)
MMMUval	58.3	68.3	69.1	64.5	54.1	41.1
DocVQAtest	94.1	95.2	92.8	96.5	94.5	90.1
InfoVQAtest	82.0	-	-	84.5	76.5	65.5
ChartQAtest	88.4	90.8	85.7	88.3	83.0	73.5
TextVQAval	84.4	-	-	85.5	84.3	79.7
OCRBench	852	788	736	855	845	794
MTVQA	17.3	25.7	27.8	32.6	26.3	20.0
RealWorldQA	72.2	60.1	75.4	77.8	70.1	62.9
MMEsum	2414.7	1920.0	2328.7	2482.7	2326.8	1872.0
MMBench-ENtest	86.5	79.7	83.4	86.5	83.0	74.9
MMBench-CNtest	86.3	80.7	82.1	86.6	80.5	73.5
MMBench-V1.1test	85.5	78.5	82.2	85.9	80.7	72.2
MMT-Benchtest	63.4	-	65.5	71.7	63.7	54.5
MMStar	67.1	62.2	63.9	68.3	60.7	48.0
MMVetGPT-4-Turbo	65.7	66.0	69.1	74.0	62.0	49.5
HallBenchavg	55.2	49.9	55.0	58.1	50.6	41.7
MathVistatestmini	67.5	67.7	63.8	70.5	58.2	43.0
MathVision	16.97	-	30.4	25.9	16.3	12.4

视频基准测试

Benchmark	Previous SoTA (Open-source LVLM)	Gemini 1.5-Pro	GPT-4o	Qwen2-VL-72B (Coming soon)	Qwen2-VL-7B (🤗 🤖)	Qwen2-VL-2B (🤗🤖)
MVBench	69.6	-	-	73.6	67.0	63.2
PerceptionTesttest	66.9	-	-	68.0	62.3	53.9
EgoSchematest	62.0	63.2	72.2	77.9	66.7	54.9
Video-MME (wo/w subs)	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8	63.3/69.0	55.6/60.4

代理基准测试

	Benchmark	Metric	Previous SoTA	GPT-4o	Qwen2-VL-72B
General	FnCall[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
Game	Number Line	SR	89.4[2]	91.5	100.0
	BlackJack	SR	40.2[2]	34.5	42.6
	EZPoint	SR	50.0[2]	85.5	100.0
	Point24	SR	2.6[2]	3.0	4.5
Android	AITZ	TM	83.0[3]	70.0	89.6
		EM	47.7[3]	35.3	72.1
AI2THOR	ALFREDvalid-unseen	SR	67.7[4]	-	67.8
		GC	75.3[4]	-	75.8
VLN	R2Rvalid-unseen	SR	79.0	43.7[5]	51.7
	REVERIEvalid-unseen	SR	61.0	31.6[5]	31.0

SR、GC、TM和EM分别表示成功率、目标条件成功、类型匹配和精确匹配。
>> 自主策划的功能调用基准测试（由Qwen团队）
>> 使用强化学习微调大型视觉语言模型作为决策代理
>> Zoo中的安卓：GUI代理的链式动作思维
>> ThinkBot：具有思维链推理的具身指令跟随
>> MapGPT：基于地图引导的提示与适应性路径规划，用于视觉和语言导航

多语言基准测试

这些结果在MTVQA基准测试上进行了评估。

Models	AR	DE	FR	IT	JA	KO	RU	TH	VI	AVG
Qwen2-VL-72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	32.6
GPT-4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

4、新闻

2024.08.30：我们已发布Qwen2-VL系列。2B和7B模型现已发布，72B开源模型即将推出。

5、限制

尽管 Qwen2-VL 适用于多种视觉任务，但同样重要的是了解其局限性。以下是一些已知的限制：
>> 缺乏音频支持：当前模型无法理解视频中的音频信息。
>> 数据的时效性：我们的图像数据集更新至 2023 年 6 月，此日期之后的信息可能未涵盖。
>> 个体和知识产权（IP）的限制：模型识别特定个体或知识产权的能力有限，可能无法全面覆盖所有知名人物或品牌。
>> 复杂指令的有限处理能力：在处理复杂的多步骤指令时，模型的理解和执行能力需要改进。
>> 计数精度不足：尤其是在复杂场景中，物体计数的精度不高，需要进一步改进。
>> 空间推理能力较弱：特别是在 3D 空间中，模型对物体位置关系的推断能力不足，难以准确判断物体的相对位置。
这些限制为模型优化和改进提供了持续的方向，我们致力于不断提升模型的性能和应用范围。

Qwen2-VL的安装和使用方法

1、安装

下面，我们提供了一些简单的例子，展示如何使用Qwen2-VL与�� ModelScope和�� Transformers。
Qwen2-VL的代码已在最新的Hugging Face Transformers中，我们建议你从源码构建，使用以下命令：

pip install git+https://github.com/huggingface/transformers accelerate

否则你可能会遇到以下错误：

KeyError: 'qwen2_vl'

我们提供了一个工具包，可以帮助你更方便地处理各种类型的视觉输入，就像使用API一样。这包括base64、URL和交错的图像和视频。你可以使用以下命令安装它：

pip install qwen-vl-utils

2、使用方法

(1)、使用Transformers进行聊天

这里我们展示了如何使用transformers和qwen_vl_utils进行聊天模型的代码片段。

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

(2)、ModelScope

我们强烈建议用户，特别是中国大陆的用户使用ModelScope。snapshot_download可以帮助你解决下载检查点的问题。

提高性能的图像分辨率

该模型支持多种分辨率输入。默认情况下，它使用输入的原生分辨率，但更高的分辨率可以提高性能，代价是更多的计算量。用户可以设置最小和最大像素数量，以实现其需求的最佳配置，例如256-1280的标记数量范围，以平衡速度和内存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，我们提供了两种方法来对模型的图像大小输入进行精细控制：
>> 指定确切尺寸：直接设置resized_height和resized_width。这些值将四舍五入为28的最接近倍数。
>> 定义min_pixels和max_pixels：图像将被调整大小以在min_pixels和max_pixels的范围内保持其纵横比。

添加多个图像输入的ID

默认情况下，图像和视频内容直接包含在对话中。在处理多张图像时，为图像和视频添加标签有助于更好的参考。用户可以通过以下设置控制这种行为：

添加视觉ID

Flash-Attention 2加速生成

首先，请确保安装最新版本的Flash Attention 2：

pip install -U flash-attn --no-build-isolation

此外，你应该有与Flash-Attention 2兼容的硬件。请阅读flash attention仓库的官方文档了解更多信息。FlashAttention-2只能在模型加载为torch.float16或torch.bfloat16时使用。

要使用Flash Attention-2加载和运行模型，只需在加载模型时添加attn_implementation="flash_attention_2"，如下所示：

from transformers import Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2",
)

(4)、试试Qwen2-VL-72B的API！

为了探索更有趣的多模态模型Qwen2-VL-72B，我们鼓励你测试我们最先进的API服务。让我们现在开始这段激动人心的旅程吧！

pip install dashscope


import dashscope
dashscope.api_key = "your_api_key"

messages = [{
    'role': 'user',
    'content': [
        {
            'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
        },
        {
            'text': 'What are in the image?'
        },
    ]
}]
# The model name 'qwen-vl-max-0809' is the identity of 'Qwen2-VL-72B'.
response = dashscope.MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages)
print(response)

更多用法，请参考阿里云的教程。

3、量化

对于量化模型，我们提供了两种类型的量化：AWQ和GPQ(��)。

(1)、AWQ

我们推荐使用AWQ与AutoAWQ。AWQ是指激活感知权重量化，一种对LLM低比特权重量化的硬件友好方法。AutoAWQ是一个易于使用的4位量化模型包。

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct-AWQ",
#     torch_dtype="auto",
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

使用Transformers的AWQ量化模型

如果您希望将自己的模型量化为 AWQ 量化模型，我们建议您使用 AutoAWQ。建议通过安装源代码的方式来安装分支版本的包：

git clone https://github.com/kq-chen/AutoAWQ.git cd AutoAWQ pip install numpy gekko pandas pip install -e .

假设您已经基于 Qwen2-VL-7B 微调了一个模型。为了构建您自己的 AWQ 量化模型，您需要使用训练数据进行校准。以下是一个简单的示例供您运行：

from transformers import Qwen2VLProcessor from awq.models.qwen2vl import Qwen2VLAWQForConditionalGeneration # 指定量化的路径和超参数 model_path = "your_model_path" quant_path = "your_quantized_model_path" quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} # 使用 AutoAWQ 加载您的处理器和模型 processor = Qwen2VLProcessor.from_pretrained(model_path) # 我们建议启用 flash_attention_2 以实现更好的加速和内存节省 # model = Qwen2VLAWQForConditionalGeneration.from_pretrained( # model_path, model_type="qwen2_vl", use_cache=False, attn_implementation="flash_attention_2" # ) model = Qwen2VLAWQForConditionalGeneration.from_pretrained( model_path, model_type="qwen2_vl", use_cache=False )

接下来，您需要准备用于校准的数据。您只需将样本放入一个列表中，每个样本都是一个典型的聊天消息，如下所示。您可以在内容字段中指定文本和图像，例如：

dataset = [ # 消息 0 [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me who you are."}, {"role": "assistant", "content": "I am a large language model named Qwen..."}, ], # 消息 1 [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/your/image.jpg"}, {"type": "text", "text": "Output all text in the image"}, ], }, {"role": "assistant", "content": "The text in the image is balabala..."}, ], # 其他消息... ..., ]

在这里，我们仅使用了一个图像标题数据集作为示例。您应将其替换为自己的 SFT 数据集。

def prepare_dataset(n_sample: int = 8) -> list[list[dict]]: from datasets import load_dataset dataset = load_dataset( "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]" ) return [ [ { "role": "user", "content": [ {"type": "image", "image": sample["url"]}, {"type": "text", "text": "generate a caption for this image"}, ], }, {"role": "assistant", "content": sample["caption"]}, ] for sample in dataset ] dataset = prepare_dataset()

然后将数据集处理为张量：

from qwen_vl_utils import process_vision_info text = processor.apply_chat_template( dataset, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(dataset) inputs = processor( text=text, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", )

然后只需通过一行代码运行校准过程：

model.quantize(calib_data=inputs, quant_config=quant_config)

最后，保存量化后的模型：

model.model.config.use_cache = model.model.generation_config.use_cache = True model.save_quantized(quant_path, safetensors=True, shard_size="4GB") processor.save_pretrained(quant_path)

这样您就可以获得自己的 AWQ 量化模型以进行部署了。尽情享受吧！

(2)、GPTQ

使用 GPTQ 模型与 Transformers

现在，Transformers 已经正式支持 AutoGPTQ，这意味着您可以直接使用经过量化的模型与 Transformers。下面是一个非常简单的代码片段，展示了如何使用量化模型运行 Qwen2-VL-7B-Instruct-GPTQ-Int4：

使用GPTQ模型与Transformers 现在，Transformers已正式支持AutoGPTQ，这意味着您可以直接使用量化后的模型与Transformers一起工作。以下是运行Qwen2-VL-7B-Instruct-GPTQ-Int4与量化模型的一个非常简单的代码片段：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen2-VL-7B-Instruct-GPTQ-Int4",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen2-VL-7B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen2-VL-7B-Instruct-GPTQ-Int4", min_pixels=min_pixels, max_pixels=max_pixels
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

使用AutoGPTQ量化您自己的模型如果您想将您自己的模型量化为GPTQ量化模型，我们建议您使用AutoGPTQ。建议通过安装源代码来安装该包的分叉版本：

git clone https://github.com/kq-chen/AutoGPTQ.git
cd AutoGPTQ
pip install numpy gekko pandas
pip install -vvv --no-build-isolation -e .

假设您已经基于Qwen2-VL-7B微调了一个模型。要构建您自己的GPTQ量化模型，您需要使用训练数据来进行校准。下面，我们为您提供一个简单的演示来运行：

from transformers import Qwen2VLProcessor
from auto_gptq import BaseQuantizeConfig
from auto_gptq.modeling import Qwen2VLGPTQForConditionalGeneration

# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quantize_config = BaseQuantizeConfig(
    bits=8,  # 4 or 8
    group_size=128,
    damp_percent=0.1,
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    static_groups=False,
    sym=True,
    true_sequential=True,
)
# Load your processor and model with AutoGPTQ
processor = Qwen2VLProcessor.from_pretrained(model_path)
# We recommend enabling flash_attention_2 for better acceleration and memory saving
# model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config, attn_implementation="flash_attention_2")
model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config)

然后您需要准备您的数据用于校准。您需要做的就是把样本放入列表中，其中每一个都是如下面所示的标准聊天消息。您可以在content字段中指定文本和图像，例如：

dataset = [
    # message 0
    [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me who you are."},
        {"role": "assistant", "content": "I am a large language model named Qwen..."},
    ],
    # message 1
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "file:///path/to/your/image.jpg"},
                {"type": "text", "text": "Output all text in the image"},
            ],
        },
        {"role": "assistant", "content": "The text in the image is balabala..."},
    ],
    # other messages...
    ...,
]

这里，我们仅为了演示目的使用了一个字幕数据集。您应该将其替换为您自己的sft数据集。

def prepare_dataset(n_sample: int = 20) -> list[list[dict]]:
    from datasets import load_dataset

    dataset = load_dataset(
        "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]"
    )
    return [
        [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": sample["url"]},
                    {"type": "text", "text": "generate a caption for this image"},
                ],
            },
            {"role": "assistant", "content": sample["caption"]},
        ]
        for sample in dataset
    ]


dataset = prepare_dataset()

然后将数据集处理成张量：

from qwen_vl_utils import process_vision_info


def batched(iterable, n: int):
    # batched('ABCDEFG', 3) → ABC DEF G
    assert n >= 1, "batch size must be at least one"
    from itertools import islice

    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch


batch_size = 1
calib_data = []
for batch in batched(dataset, batch_size):
    text = processor.apply_chat_template(
        batch, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(batch)
    inputs = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    calib_data.append(inputs)

然后只需一行代码即可运行校准过程：

model.quantize(dataset, cache_examples_on_gpu=False)

最后，保存量化后的模型：

model.save_quantized(quant_path, use_safetensors=True)
processor.save_pretrained(quant_path)

这样您就可以获得自己的GPTQ量化模型以部署了。祝您使用愉快！

4、基准测试

(1)、量化模型的性能

本节报告了 Qwen2-VL 系列的量化模型（包括 GPTQ 和 AWQ）的生成性能。具体来说，我们报告了以下指标：

MMMU_VAL（准确率）

DocVQA_VAL（准确率）

MMBench_DEV_EN（准确率）

MathVista_MINI（准确率）

我们使用 VLMEvalkit 对所有模型进行评估。

速度基准测试

本节报告了 Qwen2-VL 系列 bf16 模型、量化模型（包括 GPTQ-Int4、GPTQ-Int8 和 AWQ）的速度性能。具体来说，我们报告了在不同上下文长度条件下的推理速度（tokens/s）以及内存占用（GB）。

使用 Huggingface Transformers 进行评估的环境是：

NVIDIA A100 80GB

CUDA 11.8

Pytorch 2.2.1+cu118

Flash Attention 2.6.1

Transformers 4.38.2

AutoGPTQ 0.6.0+cu118

AutoAWQ 0.2.5+cu118（autoawq_kernels 0.0.6+cu118）

注意：

我们使用批量大小为 1 并尽可能少的 GPU 数量进行评估。

我们测试了生成 2048 个 tokens 时，输入长度分别为 1、6144、14336、30720、63488 和 129024 tokens 的速度和内存。

5、部署

我们推荐使用 vLLM 进行快速 Qwen2-VL 部署和推理。您可以使用这个 fork（我们正在努力将此 PR 合并到 vLLM 主仓库）。

运行下面的命令来启动一个与 OpenAI 兼容的 API 服务：

然后，您可以使用以下 API 进行聊天（通过 curl 或 API）：

注意：现在 vllm.entrypoints.openai.api_server 不支持在消息中设置 min_pixels 和 max_pixels（我们正在努力支持此功能）。如果您想限制分辨率，可以在模型的 preprocessor_config.json 中设置它们：

您还可以使用 vLLM 本地推理 Qwen2-VL：

6、训练

LLaMA-Factory

这里我们提供了一个用于 LLaMA-Factory https://github.com/hiyouga/LLaMA-Factory 进行 Qwen2-VL 监督微调的脚本。这个监督微调（SFT）脚本具有以下特点：
>> 支持多图像输入；
>> 支持单 GPU 和多 GPU 训练；
>> 支持全参数调优和 LoRA。

以下是该脚本的使用细节。

安装

开始之前，请确保已安装以下软件包：

按照 LLaMA-Factory 的说明 https://github.com/hiyouga/LLaMA-Factory 构建环境。

安装这些软件包（可选）：

pip install deepspeed

pip install flash-attn --no-build-isolation

如果要使用 FlashAttention-2 https://github.com/Dao-AILab/flash-attention，请确保 CUDA 版本为 11.6 及以上。

数据准备

LLaMA-Factory 在数据文件夹中提供了几个训练数据集，您可以直接使用。如果您使用自定义数据集，请按以下方式准备您的数据集。

将数据组织在一个 JSON 文件中，并将数据放入数据文件夹中。LLaMA-Factory 支持 ShareGPT 格式的多模态数据集。ShareGPT 格式的数据集应遵循以下格式：

在 data/dataset_info.json 中提供您的数据集定义，格式如下。对于 ShareGPT 格式的数据集，dataset_info.json 中的列应为：

训练

LoRA SFT 示例：

llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml
llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml

全量 SFT 示例：

llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml

推理示例：

llamafactory-cli webchat examples/inference/qwen2_vl.yaml
llamafactory-cli api examples/inference/qwen2_vl.yaml

执行以下训练命令：

享受训练过程。要更改您的训练，可以通过修改训练命令中的参数来调整超参数。需要注意的一个参数是 cutoff_len，即训练数据的最大长度。控制该参数以避免 OOM 错误。

7、功能调用

Qwen2-VL 支持功能调用（又称工具调用或工具使用）。有关如何使用此功能的详细信息，请参阅 Qwen-Agent 项目中的功能调用示例和代理示例。

(1)、简单用例：

# pip install qwen_agent
from typing import List, Union
from datetime import datetime
from qwen_agent.agents import FnCallAgent
from qwen_agent.gui import WebUI
from qwen_agent.tools.base import BaseToolWithFileAccess, register_tool

@register_tool("get_date")
class GetDate(BaseToolWithFileAccess):
    description = "call this tool to get the current date"
    parameters = [
        {
            "name": "lang",
            "type": "string",
            "description": "one of ['en', 'zh'], default is en",
            "required": False
        },
    ]

    def call(self, params: Union[str, dict], files: List[str] = None, **kwargs) -> str:
        super().call(params=params, files=files)
        params = self._verify_json_format_args(params)
        lang = "zh" if "zh" in params["lang"] else "en"
        now = datetime.now()
        result = now.strftime("%Y-%m-%d %H:%M:%S") + "\n"
        weekday = now.weekday()
        if lang == "zh":
            days_chinese = ["一", "二", "三", "四", "五", "六", "日"]
            result += "今天是星期" + days_chinese[weekday]
        else:
            days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
            result += "Today is " + days[weekday]
        return result


def init_agent_service():
    llm_cfg_vl = {
        # Using Qwen2-VL deployed at any openai-compatible service such as vLLM:
        "model_type": "qwenvl_oai",
        "model": "Qwen/Qwen2-VL-7B-Instruct",
        "model_server": "http://localhost:8000/v1",  # api_base
        "api_key": 'EMPTY",
    }
    tools = [
        "get_date",
        "code_interpreter",
    ]  # code_interpreter is a built-in tool in Qwen-Agent
    bot = FnCallAgent(
        llm=llm_cfg_vl,
        name="Qwen2-VL",
        description="function calling",
        function_list=tools,
    )
    return bot

def app_gui():
    # Define the agent
    bot = init_agent_service()
    WebUI(bot).run()

# Launch gradio app
app_gui()

8、演示

Web UI 示例

本节为用户提供了构建基于 Web 的用户界面（UI）演示的说明。此 UI 演示允许用户通过 Web 浏览器与预定义的模型或应用程序进行交互。按照以下步骤开始。

安装

在开始之前，请确保您的系统上已安装所需的依赖项。您可以通过运行以下命令来安装它们：

pip install -r requirements_web_demo.txt

使用 FlashAttention-2 运行演示

安装完所需的软件包后，您可以使用以下命令启动 Web 演示。此命令将启动一个 Web 服务器，并为您提供一个链接以在 Web 浏览器中访问 UI。

推荐：为了在多图像和视频处理场景中获得更好的性能和效率，我们强烈建议使用 FlashAttention-2。FlashAttention-2 在内存使用和速度方面提供了显著的改进，非常适合处理大规模模型和数据处理。

要启用 FlashAttention-2，请使用以下命令：

 web_demo_mm.py --flash-attn2

这将加载启用了 FlashAttention-2 的模型。

默认用法：如果您更喜欢不使用 FlashAttention-2 运行演示，或者如果您未指定 --flash-attn2 选项，演示将使用标准注意力实现加载模型：

 web_demo_mm.py

运行命令后，您将在终端看到一个类似这样的链接：

Running on local: http://127.0.0.1:7860/

复制此链接并将其粘贴到浏览器中，以访问 Web UI，您可以通过输入文本、上传图像或使用任何其他提供的功能与模型进行交互。

选择不同的模型（仅限 Qwen2-VL 系列）

演示默认配置为使用 Qwen/Qwen2-VL-7B-Instruct 模型，该模型是 Qwen2-VL 系列的一部分，非常适合各种视觉语言任务。但是，如果您想使用 Qwen2-VL 系列中的其他模型，只需在脚本中更新 DEFAULT_CKPT_PATH 变量：

定位 DEFAULT_CKPT_PATH 变量：在 web_demo_mm.py 文件中，找到定义模型检查点路径的 DEFAULT_CKPT_PATH 变量。它的格式应如下所示：

DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-7B-Instruct'

替换为不同的 Qwen2-VL 模型路径：将 DEFAULT_CKPT_PATH 修改为指向 Qwen2-VL 系列中的另一个检查点路径。例如：

DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-2B-Instruct'  # 示例：系列中的不同模型

保存并重新运行：修改路径后，保存脚本，然后根据上面“运行演示”部分中提供的说明重新运行演示。

注意：此 DEFAULT_CKPT_PATH 仅支持 Qwen2-VL 系列的模型。如果您使用的是 Qwen2-VL 系列之外的模型，可能需要对代码库进行其他更改。

定制化

通过修改 web_demo_mm.py 脚本，可以进一步自定义 Web 演示，包括 UI 布局、交互和其他功能（如处理特殊输入）。这种灵活性使您能够根据特定任务或工作流程调整 Web 界面。

9、Docker

为了简化部署过程，我们提供了带有预构建环境的 Docker 镜像：qwenllm/qwenvl。您只需要安装驱动程序并下载模型文件即可启动演示。

docker run --gpus all --ipc=host --network=host --rm --name qwen2 -it qwenllm/qwenvl:2-cu121 bash

Qwen2-VL的案例应用

持续更新中……