基于阿里云免费部署Qwen1-8B-chat模型并进行lora参数微调从0到1上手操作

文章目录

一、申请资源
二、创建实例
三、克隆微调数据
四、部署Qwen1-8B-chat模型
- 1、环境配置
- 2、模型下载
- 3、本地模型部署
五、模型微调
- 1、拉取Qwen仓库源码
- 2、微调配置
- 3、合并微调参数
- 4、本地部署微调模型

一、申请资源

阿里云账号申请PAI资源详细教程我已于部署ChatGLM3时写过：https://blog.csdn.net/Yaki_Duck/article/details/142101802?fromshare=blogdetail&sharetype=blogdetail&sharerId=142101802&sharerefer=PC&sharesource=Yaki_Duck&sharefrom=from_link

二、创建实例

从上面领取的资源中或者点击【控制台】->【交互式建模（DSW）】点击进入创建实例：在这里插入图片描述

这里的镜像和资源规格可以按照我的选择傻瓜式部署：
ecs.gn7i-c8g1.2xlarge (8 vCPU, 30 GiB, NVIDIA A10 * 1)
modelscope:1.11.0-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04
在这里插入图片描述

点击确定，完成实例创建。
其后回到控制台启动、打开新建的实例，点击新建一个notebook（.ipynb结尾）。
在这里插入图片描述

三、克隆微调数据

数据地址：https://github.com/52phm/qwen_1_8chat_finetune?tab=readme-ov-file
数据说明：

qwen_chat.json（小份数据）
chat.json（中份数据）

部分数据示例：

[
    {
        "id": "identity_0",
        "conversations": [
            {
                "from": "user",
                "value": "识别以下句子中的地址信息，并按照{address:['地址']}的格式返回。如果没有地址，返回{address:[]}。句子为：在一本关于人文的杂志中，我们发现了一篇介绍北京市海淀区科学院南路76号社区服务中心一层的文章，文章深入探讨了该地点的人文历史背景以及其对于当地居民的影响。"
            },
            {
                "from": "assistant",
                "value": "{\"address\":\"北京市海淀区科学院南路76号社区服务中心一层\"}"
            }
        ]
    }
]

四、部署Qwen1-8B-chat模型

1、环境配置

首先安装所需要的一些包和库：

!pip install deepspeed transformers==4.32.0 peft pydantic==1.10.13 transformers_stream_generator einops tiktoken modelscope

2、模型下载

在阿里魔搭社区notebook的jupyterLab中：下载模型会缓存在 /mnt/workspace/.cache/modelscope/。本地部署一般会缓存到你的C盘或用户空间，所以要根据自己情况查看模型。也可以通过下面日志查看模型所在位置，如2024-03-16 16:30:54,106 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer。

通过该命令下载的模型就是通过modelscope 社区以 ls 的形式下载，模型的存储地址为：/mnt/workspace/.cache/modelscope/qwen/Qwen-1_8B-Chat/。

%%time
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen-1_8B-Chat')
!ls /mnt/workspace/.cache/modelscope/qwen/Qwen-1_8B-Chat/

3、本地模型部署

%%time
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig 


query = "识别以下句子中的地址信息，并按照{address:['地址']}的格式返回。如果没有地址，返回{address:[]}。句子为：在一本关于人文的杂志中，我们发现了一篇介绍北京市海淀区科学院南路76号社区服务中心一层的文章，文章深入探讨了该地点的人文历史背景以及其对于当地居民的影响。"
local_model_path = "/mnt/workspace/.cache/modelscope/qwen/Qwen-1_8B-Chat/"
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(local_model_path, device_map="auto", trust_remote_code=True).eval()
response, history = model.chat(tokenizer, query, history=None)
print("回答如下:\n", response)

运行结果：

The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.11it/s]


回答如下:
 在这个句子中，有三个地址信息：
1. 北京市海淀区科学院南路76号社区服务中心一层。
2. 文章深入探讨了该地点的人文历史背景以及其对于当地居民的影响。

按照{address:['地址']}的格式返回：
在一本关于人文的杂志中，我们发现了一篇介绍北京市海淀区科学院南路76号社区服务中心一层的文章，文章深入探讨了该地点的人文历史背景以及其对于当地居民的影响。
CPU times: user 3.51 s, sys: 280 ms, total: 3.79 s
Wall time: 3.79 s

在这里我们可以发现部署的模型并没有理解我们意思按照我们的要求来做，我们并没有得到想要的答案，因此，我们需要继续对其进行微调。

五、模型微调

1、拉取Qwen仓库源码

!git clone https://github.com/QwenLM/Qwen.git

2、微调配置

本次使用 LoRA 参数进行微调，调用Qwen/finetune.py文件进行配置与微调。

参数配置：

–model_name_or_path
Qwen-1_8B-Chat：指定预训练模型的名称或路径，这里是使用名为"Qwen-1_8B-Chat"的预训练模型。
–data_path chat.json：指定训练数据和验证数据的路径，这里是使用名为"chat.json"的文件。
–fp16 True：指定是否使用半精度浮点数（float16）进行训练，这里设置为True。
–output_dir output_qwen：指定输出目录，这里是将训练结果保存到名为"output_qwen"的文件夹中。
–num_train_epochs 5：指定训练的轮数，这里是训练5轮。
–per_device_train_batch_size 2：指定每个设备（如GPU）上用于训练的批次大小，这里是每个设备上训练2个样本。
–per_device_eval_batch_size 1：指定每个设备上用于评估的批次大小，这里是每个设备上评估1个样本。
–gradient_accumulation_steps 8：指定梯度累积步数，这里是梯度累积8步后再更新模型参数。
–evaluation_strategy “no”：指定评估策略，这里是不进行评估。
–save_strategy “steps”：指定保存策略，这里是每隔一定步数（如1000步）保存一次模型。
–save_steps 1000：指定保存步数，这里是每隔1000步保存一次模型。
–save_total_limit 10：指定最多保存的模型数量，这里是最多保存10个模型。
–learning_rate 3e-4：指定学习率，这里是3e-4。
–weight_decay 0.1：指定权重衰减系数，这里是0.1。
–adam_beta2 0.95：指定Adam优化器的beta2参数，这里是0.95。
–warmup_ratio 0.01：指定预热比例，这里是预热比例为总步数的1%。
–lr_scheduler_type “cosine”：指定学习率调度器类型，这里是余弦退火调度器。
–logging_steps 1：指定日志记录步数，这里是每1步记录一次日志。
–report_to “none”：指定报告目标，这里是不报告任何信息。
–model_max_length 512：指定模型的最大输入长度，这里是512个字符。
–lazy_preprocess True：指定是否使用懒加载预处理，这里设置为True。
–gradient_checkpointing：启用梯度检查点技术，可以在训练过程中节省显存并加速训练。
–use_lora：指定是否使用LORA（Layer-wise Relevance Analysis）技术，这里设置为True

微调代码：（注意：--data_path /mnt/workspace/qwen_1_8chat_finetune/qwen_chat.json \中填写自己下载的数据的存储地址）

%%time
!python ./Qwen/finetune.py \
--model_name_or_path "/mnt/workspace/.cache/modelscope/qwen/Qwen-1_8B-Chat/" \
--data_path /mnt/workspace/qwen_1_8chat_finetune/qwen_chat.json \
--fp16 False\
--output_dir output_qwen \
--num_train_epochs 10 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 3e-4 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 512 \
--lazy_preprocess True \
--gradient_checkpointing True \
--use_lora True

3、合并微调参数

与全参数微调不同，LoRA和Q-LoRA的训练只需存储adapter部分的参数。使用LoRA训练后的模型，可以选择先合并并存储模型（LoRA支持合并，Q-LoRA不支持），再用常规方式读取你的新模型。

%%time
from peft import AutoPeftModelForCausalLM 
from transformers import AutoTokenizer 

# 分词
tokenizer = AutoTokenizer.from_pretrained("output_qwen", trust_remote_code=True ) 
tokenizer.save_pretrained("qwen-1_8b-finetune")

# 模型
model = AutoPeftModelForCausalLM.from_pretrained("output_qwen", device_map="auto", trust_remote_code=True ).eval() 
merged_model = model.merge_and_unload() 
merged_model.save_pretrained("qwen-1_8b-finetune", max_shard_size="2048MB", safe_serialization=True) # 最大分片2g

4、本地部署微调模型

使用微调后且合并的模型进行本地部署。

%%time
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig 


query = "识别以下句子中的地址信息，并按照{address:['地址']}的格式返回。如果没有地址，返回{address:[]}。句子为：在一本关于人文的杂志中，我们发现了一篇介绍北京市海淀区科学院南路76号社区服务中心一层的文章，文章深入探讨了该地点的人文历史背景以及其对于当地居民的影响。"
local_model_path = "qwen-1_8b-finetune"
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(local_model_path, device_map="auto", trust_remote_code=True).eval()
response, history = model.chat(tokenizer, query, history=None)
print("回答如下:\n", response)

运行结果：

Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.03it/s]


回答如下:
 {"address":"北京市海淀区科学院南路76号社区服务中心一层"}
CPU times: user 1.66 s, sys: 269 ms, total: 1.93 s
Wall time: 1.93 s

这里就可以很清楚的看见模型通过微调训练明白了我们的意思，成功提取了我们想要的信息。

reference:https://blog.csdn.net/qq_41731978/article/details/136766174?fromshare=blogdetail&sharetype=blogdetail&sharerId=136766174&sharerefer=PC&sharesource=Yaki_Duck&sharefrom=from_link