Qwen-VL模型微调及遇到的一些小问题

Qwen-VL 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。相比较前文提到的llava-llama3的模型，它相对更成熟一些，功能更强大一些。

比较有特点的功能：

多图交错对话：支持多图输入和比较，指定图片问答，多图文学创作等；
首个支持中文开放域定位的通用模型：通过中文开放域语言表达进行检测框标注；
细粒度识别和理解：相比于目前其它开源LVLM使用的224分辨率，Qwen-VL是首个开源的448分辨率的LVLM模型。更高分辨率可以提升细粒度的文字识别、文档问答和检测框标注。

1、模型原理

2、模型结构

3、模型使用

4、模型微调

1、模型原理

整体上来说，Qwen-VL采用了类似于flamingo的多模态结构，通过输入图像和可学习的Qurey序列和图像特征进行注意力计算，进行特征查询和压缩，压缩后再和text同步输入到llm中进行输出。

上图为Flamingo的基本结构，它提出了visual resampler、cross-attention adapter等模块做图文对齐，通过感知器重采样器模块将视觉编码器连接到冻结的语言模型，并将来自视觉编码器的可变数量的图像或视频特征作为输入，产生固定数量的视觉输出。

Qwen-VL的整体结构如下，它参考了Flamingo的visual resampler模块做视觉输出，利用该输出与大模型进行结合。

Qwen-VL模型网络包括视觉编码器（Vision Encoder）、视觉语言适配器（VL Adapter）、语言模型（LLM）三部分，其中编码器1.9B、视觉语言适配器0.08B、语言模型7.7B，共计9.6B。

从图中可以看出具体的训练过程分为三步：

预训练：只优化视觉编码器和视觉语言适配器，冻结语言模型。使用大规模图像-文本配对数据，输入图像分辨率为224x224。
多任务预训练：引入更高分辨率（448x448）的多任务视觉语言数据，如VQA、文本VQA、指称理解等，进行多任务联合预训练。
监督微调：冻结视觉编码器，优化语言模型和适配器。使用对话交互数据进行提示调优，得到最终的带交互能力的Qwen-VL-Chat模型。

2、模型结构

ModuleList语言模型部分：包含32个QwenBlock，每个QwenBlock中包含1个QwenAttention和QwenMLP
ViT视觉编码器部分：包含TransformerBlock和Resampler部分： TransformerBlock包含48个VisualAttentionBlock，每个VisualAttentionBlock包含1个1664维输入的VisualAttention和1个Sequential的mlp， Resampler包含1个MultiheadAttention

下面简单从代码中对应查看一下

a. VIT视觉编码部分：

可以查看visual.py文件

这里定义了VisionTransformer类用来提取图像特征，整体上就是一个ViT先进行特征提取，然后通过Resampler进行压缩适配。

我们可以看到，这里在117行定义了Query，使用Query和VIT的图像特征做attention计算，来进行特征压缩。

b. 语言模型部分：

可以查看modeling_qwen.py

通过多个QWenBlock叠加完成modellist的搭建。

3、模型使用

模型使用可以参考官方文档，使用HuggingFace或ModelScope都可以，建议使用ModelScope下载模

型比较快，或者下载模型后加载使用也行。

from modelscope import (
    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch
model_id = 'qwen/Qwen-VL-Chat'
revision = 'v1.0.0'

model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
    tokenizer.model_dir = model_dir
# 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理，需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理，需要约24GB显存
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

# 指定生成超参数（transformers 4.32.0及以上无需执行此操作）
# model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)

# 第一轮对话
# Either a local path or an url between <img></img> tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None)
print(response)
# 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。

# 第二轮对话
response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)
print(response)
# <ref>"击掌"</ref><box>(211,412),(577,891)</box>
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('output_chat.jpg')
else:
  print("no box")

4、模型微调

a.准备数据

数据格式为：

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是Qwen-VL,一个支持视觉输入的大模型。"
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "user",
        "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？"
      },
      {
        "from": "assistant",
        "value": "图中是一只拉布拉多犬。"
      },
      {
        "from": "user",
        "value": "框出图中的格子衬衫"
      },
      {
        "from": "assistant",
        "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>"
      }
    ]
  },
  { 
    "id": "identity_2",
    "conversations": [
      {
        "from": "user",
        "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
      },
      {
        "from": "assistant",
        "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。"
      }
    ]
  }
]

其中几个特殊token，<img> </img> 代表图片地址；<ref> </ref>代表检测框标题； <box> </box>代表检测框位置(其中 (x1, y1) 和(x2, y2)分别对应左上角和右下角的坐标，并且被归一化到[0, 1000)的范围内)

b. 微调

我这里使用的lora进行的微调，需要修改.\finetune\finetune_lora_single_gpu.sh文件中的DATA地址为对应的数据集json地址。

# 单卡训练
sh finetune/finetune_lora_single_gpu.sh
# 分布式训练
sh finetune/finetune_lora_ds.sh

与全参数微调不同，LoRA和Q-LoRA的训练只需存储adapter部分的参数。假如你需要使用LoRA训练后的模型，你需要使用如下方法。你可以用如下代码读取模型：

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    path_to_adapter, # path to the output directory
    device_map="auto",
    trust_remote_code=True
).eval()

如果你觉得这样一步到位的方式让你很不安心或者影响你接入下游应用，你可以选择先合并并存储模型（LoRA支持合并，Q-LoRA不支持），再用常规方式读取你的新模型，示例如下：

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    path_to_adapter, # path to the output directory
    device_map="auto",
    trust_remote_code=True
).eval()

merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary. 
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)

c. 微调过程的一些问题