unsloth微调QwQ32B(4bit)
GPU: 3090 24G
unsloth安装部署
-
pip 安装
pip install unsloth --index https://pypi.mirrors.usrc.edu.cn/simple
source /etc/network_turbo pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
注册Wandb以监控模型微调过程
-
wandb地址
https://wandb.ai/site
-
登录
下载
pip install wandb
使用api-key登录
wandb login
-
使用官网示例看一看
备注:
- 需要联网
- 需要将key改为自己的
- entity需要提前设立
import random import wandb wandb.login(key="api-key") # Start a new wandb run to track this script. run = wandb.init( # Set the wandb entity where your project will be logged (generally your team name). entity="qinchihongye-pa", # Set the wandb project where this run will be logged. project="project_test", # Track hyperparameters and run metadata. config={ "learning_rate": 0.02, "architecture": "CNN", "dataset": "CIFAR-100", "epochs": 10, }, ) # Simulate training. epochs = 10 offset = random.random() / 5 for epoch in range(2, epochs): acc = 1 - 2**-epoch - random.random() / epoch - offset loss = 2**-epoch + random.random() / epoch + offset # Log metrics to wandb. run.log({"acc": acc, "loss": loss}) # Finish the run and upload any remaining data. run.finish()
下载QwQ32B量化模型
-
huggingface地址(unsloth量化的4bit,比Q4_K_M量化的损失精度更小)
https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit
复制名称
unsloth/QwQ-32B-unsloth-bnb-4bit
-
假设当前目录为
/root/lanyun-tmp
-
创建文件夹统一存放
Huggingface
下载的模型mkdir Hugging-Face mkdir -p Hugging-Face/QwQ-32B-unsloth-bnb-4bit
-
配置镜像源
vim ~/.bashrc
填入以下两个,以修改HuggingFace 的镜像源 、模型保存的默认
export HF_ENDPOINT=https://hf-mirror.com
export HF_HOME=/root/lanyun-tmp/Hugging-Face重新加载,查看环境变量是否生效
source ~/.bashrc echo $HF_ENDPOINT echo $HF_HOME
-
安装 HuggingFace 官方下载工具
pip install -U huggingface_hub
-
执行下载模型的命令
huggingface-cli download --resume-download unsloth/QwQ-32B-unsloth-bnb-4bit --local-dir /root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit Hugging-Face/QwQ-32B-unsloth-bnb-4bit
或者使用python下载
from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/QwQ-32B-unsloth-bnb-4bit", local_dir = "/root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit", )
transformers库调用示例
-
代码
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "/root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="cuda:0", ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "你好" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text] , return_tensors="pt" ).to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids , skip_special_tokens=True )[0] print(response)
-
显存占用:23G左右。
vllm启动示例
-
启动
cd /root/lanyun-tmp/Hugging-Face vllm serve ./QwQ-32B-unsloth-bnb-4bit \ --quantization bitsandbytes \ --load-format bitsandbytes \ --max-model-len 500 \ --port 8081
-
调用代码
from openai import OpenAI import openai openai.api_key = '1111111' # 这里随便填一个 openai.base_url = 'http://127.0.0.1:8081/v1' def get_completion(prompt, model="QwQ-32B"): client = OpenAI(api_key=openai.api_key, base_url=openai.base_url ) messages = [{"role": "user", "content": prompt}] response = client.chat.completions.create( model=model, messages=messages, stream=False ) return response.choices[0].message.content prompt = '你好,请幽默的介绍下你自己,不少于300字' get_completion(prompt, model="./QwQ-32B-unsloth-bnb-4bit")
cot数据集
-
FreedomIntelligence/medical-o1-reasoning-SFT
https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT
-
英文数据集下载
from datasets import load_dataset import rich # Login using e.g. `huggingface-cli login` to access this dataset ds = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en") rich.print(ds['train'][0])
-
中文数据集下载
from datasets import load_dataset import rich # Login using e.g. `huggingface-cli login` to access this dataset ds = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "zh") rich.print(ds['train'][0])
-
下载完成后会看到在HuggingFace目录下的datasets目录中有刚刚下载的数据
ll /root/lanyun-tmp/Hugging-Face/datasets/
unsloth加载QwQ32b模型
-
unsloth支持直接加载模型并推理,先加载模型
from unsloth import FastLanguageModel max_seq_length = 2048 dtype = None load_in_4bit = True # 4bit model,tokenizer = FastLanguageModel.from_pretrained( model_name = "/root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit/", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, )
显存占用22G左右
-
推理
# 将模型调整为推理模式 FastLanguageModel.for_inference(model) def QwQ32b_infer(question): # prompt模板 prompt_style_chat = """请写出一个恰当的回来来完成当前对话任务。 ### Instruction: 你是一名助人为乐的助手。 ### Question: {} ### Response: <think>{}""" # [prompt_style_chat.format(question,"")] inputs = tokenizer([prompt_style_chat.format(question, "")] ,return_tensors="pt" ).to("cuda") outputs = model.generate( input_ids = inputs.input_ids, max_new_tokens=2048, use_cache=True, ) response = tokenizer.batch_decode(outputs) return response[0].split("### Response:")[1] question = "证明根号2是无理数" response = QwQ32b_infer(question)
模型微调
-
测试:使用微调数据集进行测试
question_1 = "根据描述,一个1岁的孩子在夏季头皮出现多处小结节,长期不愈合,且现在疮大如梅,溃破流脓,口不收敛,头皮下有空洞,患处皮肤增厚。这种病症在中医中诊断为什么病?" question_2 = "一个生后8天的男婴因皮肤黄染伴发热和拒乳入院。体检发现其皮肤明显黄染,肝脾肿大和脐部少量渗液伴脐周红肿。在此情况下,哪种检查方法最有助于确诊感染病因?" response_1 = QwQ32b_infer(question_1) response_2 = QwQ32b_infer(question_2) print(response_1) print(response_2)
-
加载并处理数据,选择训练集前500条进行最小可行性实验
import os from datasets import load_dataset # 问答提示词模板 train_prompt_style = """下面是描述任务的指令,与提供进一步上下文的输入配对。编写适当完成请求的响应。在回答之前,仔细思考问题,并创建逐步的思想链,以确保逻辑和准确的响应。 ### Instruction: 您是一位在临床推理、诊断和治疗计划方面拥有先进知识的医学专家。请回答以下医学问题。 ### Question: {} ### Response: <think> {} </think> {}""" # 文本生成结束的基本标记 EOS_TOKEN = tokenizer.eos_token tokenizer.eos_token # '<|im_end|>' # 定义函数,对数据集进行修改 def formatting_prompts_func(examples): inputs = examples["Question"] cots = examples["Complex_CoT"] outputs = examples["Response"] texts = [] for input, cot, output in zip(inputs, cots, outputs): text = train_prompt_style.format(input, cot, output) + EOS_TOKEN texts.append(text) return { "text": texts, } # 先选择训练集前500条数据 dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT" ,"zh" , split = "train[0:500]" ,trust_remote_code=True ) dataset = dataset.map(formatting_prompts_func , batched = True ) import rich rich.print(dataset[0]) rich.print(dataset[0]['text'])
-
将模型设置为微调模式
# 将模型设置为微调模式 model = FastLanguageModel.get_peft_model( model, r=4, # r=16 # 低秩矩阵的秩 target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha=16, lora_dropout=0, bias="none", use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context random_state=1024, use_rslora=False, loftq_config=None, )
-
创建训练器(有监督微调对象)
from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model=model, # 指定需要微调的预训练模型 tokenizer=tokenizer, # 分词器 train_dataset=dataset, # 训练数据 dataset_text_field="text", # 指定数据集中那一列包含训练文本(在formatting_prompt_func里面指定) max_seq_length=max_seq_length, #最大序列长度,用于控制输入文本的最大token数量 dataset_num_proc=2, # 数据加载的并行进程数 args=TrainingArguments( per_device_train_batch_size=1, # 每个GPU/设备的戌年批量大小(较小值适合大模型) gradient_accumulation_steps=4, # 梯度累计步数,相当于batch_size=1*4=4 # num_train_epochs = 1, # 如果设置了num_train_epochs,则max_steps失效 warmup_steps=5, # 预热步数,初始阶段学习率较低,然后逐步升高 max_steps=60,# 最大训练步数 learning_rate=2e-4, # 学习率 fp16=not is_bfloat16_supported(), # 如果GPU不支持bfloat16,则使用fp16(16位浮点数) bf16=is_bfloat16_supported(), # 如果GPU支持bfloat16,则启用bf16(训练更稳定) logging_steps=10, # 每10步记录一次日志 optim="adamw_8bit", # 使用adamw_8bit 8bit adamw优化器减少显存占用 weight_decay=0.01, # 权重衰减 L2正则化,防止过拟合 lr_scheduler_type="linear", # 学习率调整策略,线性衰减 seed=1024, # 随机种子,保证实验结果可复现 output_dir="/root/lanyun-tmp/outputs", # 训练结果的输出目录 ), ) # 设置wandb(可选则) import wandb wandb.login(key="api-key") run = wandb.init(entity="qinchihongye-pa" ,project='QwQ-32B-4bit-FT' ) # 开始模型微调 trainer_stats = trainer.train() trainer_status
训练过程中的显存占用如上,训练过程如下
点击wandb链接,查看训练过程中的损失函数,学习率,梯度等等的变化。
-
unsloth在微调结束后,会自动更新模型权重(在缓存中),因此无序手动合并集合直接调用微调后的模型
FastLanguageModel.for_inference(model) new_response_1 = QwQ32b_infer(question_1) new_response_2 = QwQ32b_infer(question_2) new_response_1 new_response_2
可以看到第一个问题还是回答错了,第二个问题也如旧,可以考虑继续进行大规模微调,使用全部微调文件+多个epoch。
-
模型合并
此时本地保存的模型权重在
/root/lanyun-tmp/outputs
中
注意,unsloth中默认100步保存一个checkpoint,因为当前steps=60,所以只有一个checkpoint点。
合并保存为safetensors
model.save_pretrained_merged("/root/lanyun-tmp/QwQ-Medical-COT-Tiny" , tokenizer , save_method = "merged_4bit_forced",#保存为4bit量化 ) # model.save_pretrained_merged("dir" # , tokenizer # , save_method = "merged_16bit",#保存为16bit # )
合并为GGUF格式(需要量化,非常耗时)
# model.save_pretrained_gguf("dir" # , tokenizer # , quantization_method = "q4_k_m" # ) # model.save_pretrained_gguf("dir" # , tokenizer # , quantization_method = "q8_0" # ) # model.save_pretrained_gguf("dir" # , tokenizer # , quantization_method = "f16" # )