0.01 /小时,使用超算互联网 https://www.scnet.cn/ 国产卡推理微调大模型,初体验
官网购买算力,国产卡活动0.01 /小时,非常划算
活动地址
https://www.scnet.cn/home/subject/modular/index264.html
扫码入群,每天领算力优惠券,入群介绍记得填 wmx_scnet 推荐,谢谢大家
加速卡:异构加速卡AI * 1卡
显存:64GB
处理器:15核心 2*7490 64C
内存:110GB
镜像:jupyterlab-pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
大模型 : qwen
1 环境搭建
进入上面已经购买的算力 notebook 环境,搭建开发环境
在 /public/home/wmx_scnet/
路径下 , wmx_scnet 是账号用户名
clone代码 https://github.com/QwenLM/Qwen
cd Qwen
安装依赖 pip install -r requirements.txt
报错:
lease a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: transformers, transformers_stream_generator
Attempting uninstall: transformers
Found existing installation: transformers 4.38.0
Uninstalling transformers-4.38.0:
Successfully uninstalled transformers-4.38.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.3.3+das1.1.gitdf6349c.abi1.dtk2404.torch2.1.0 requires transformers>=4.38.0, but you have transformers 4.37.2 which is incompatible.
Successfully installed transformers-4.37.2 transformers_stream_generator-0.0.4
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
notice] A new release of pip is available: 24.0 -> 24.1.2
解决:
pip install --upgrade pip
重新运行 pip install -r requirements.txt
2运行cli_demo.py 推理
python cli_demo.py
先下载模型 Qwen-7B-Chat ,默认下载路径是 /root/.cache/huggingface
修改默认的路径为 $SCNET_USER_HOME/huggingface_cache
export HUGGINGFACE_HUB_CACHE= $SCNET_USER_HOME/huggingface_cache
mv /root/.cache/huggingface $SCNET_USER_HOME/huggingface_cache
再次运行 python cli_demo.py 不生效,还是从原来位置检查模型,没有按照环境变量 HUGGINGFACE_HUB_CACHE 指定的目录加载模型,需要手动加载
python cli_demo.py -c /public/home/wmx_scnet/huggingface_cache/hub/models--Qwen--Qwen-7B-Chat/snapshots/93a65d34827a3cc269b727e67004743b723e2f83
首次运行,大模型乱说
python cli_demo.py
User: 你是谁
Qwen-Chat: "门不吃禁还要不乘没有了息之失。
。饿逊是谁。恒江是你。你是一只恒江。汉江是我。汉江是我。你好江,你好江。你是一个好江。好的江你好江。好的江你好江。你好江。你好江。你好江。你好江。好江你好江。你好江。好江你好江。你好江。你好江。好的江你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好
国产卡 hy-smi
这命令类似 nvidia-smi
VRAM 占用 47% ,大概30G
反馈给客服,之后解决
再次运行cli_demo.py , 能正确回答问题,但是非常慢
关闭之后再次运行,发现显卡内存没有释放 hy-smi VRAM 占用 97%
需要重启实例
python cli_demo.py -c /public/home/wmx_scnet/huggingface_cache/hub/models--Qwen--Qwen-7B-Chat/snapshots/93a65d34827a3cc269b727e67004743b723e2f83
Your device support faster inference by passing bf16=True in "AutoModelForCausalLM.from_pretrained".
/opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py:1384: UserWarning: Current model requires 4194560 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.
warnings.warn(
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]
libgomp: Thread creation failed: Resource temporarily unavailable
3微调qwen模型:
使用e-file 上传模型文件到 /public/home/wmx_scnet/Qwen-1_8B-Chat
上传数据文件到/public/home/wmx_scnet/DISC-Law-SFT/train_data_law.json
修改Qwen/finetune/finetune_lora_single_gpu.sh ,指定模型和数据
MODEL=‘/public/home/wmx_scnet/Qwen-1_8B-Chat’
DATA=“/public/home/wmx_scnet/DISC-Law-SFT/train_data_law.json”
修改微调参数finetune_lora_single_gpu.sh :
python finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--bf16 True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 10 \
--learning_rate 3e-4 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 512 \
--lazy_preprocess True \
--gradient_checkpointing \
--use_lora
启动微调
sh finetune/finetune_lora_single_gpu.sh
初始显示预计需要7小时,总共310次迭代it,每次迭代大概76ms
过程中截图:
占用显存 hy-smi查看是65%
微调完毕,合并模型
输出在 /public/home/wmx_scnet/Qwen/output_qwen/checkpoint-200
使用脚本qwen_lora_merge.py合并模型
输出的模型在 /public/home/wmx_scnet/Qwen-1_8B-Chat_law_merge
推理合并后的模型
python cli_demo.py -c /public/home/wmx_scnet/Qwen-1_8B-Chat_law_merge
输出
修改cli_demo.py 指定 fp32=False
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map=device_map,
fp32=False,
trust_remote_code=True,
resume_download=False,
).eval()
因为训练使用bf16,在模型配置config.json中指定了bf16=true,所以做出上面修改
发现运行之后推理乱码输出
把模型改为fp32推理正常
微调的结果法律数据问答
总结:
6 微调使用国产卡非常慢,大概需要7小时,使用nvidia-4070ti-super 需要大概2个半小时,差距很大
4070tisuper 微调参数:
python finetune.py
–model_name_or_path $MODEL
–data_path $DATA
–bf16 False
–output_dir output_qwen
–num_train_epochs 5
–per_device_train_batch_size 8
–per_device_eval_batch_size 1
–gradient_accumulation_steps 8
–evaluation_strategy “no”
–save_strategy “steps”
–save_steps 200
–save_total_limit 10
–learning_rate 3e-4
–weight_decay 0.1
–adam_beta2 0.95
–warmup_ratio 0.01
–lr_scheduler_type “cosine”
–logging_steps 1
–report_to “none”
–model_max_length 512
–lazy_preprocess True
–gradient_checkpointing
–use_lora
下面图片是4070tisuper 微调的过程截图:
速度 7.32s /it
下图4070tisuper显卡占用情况,只占用5.28G
4070tisuper推理微调后的模型:
显卡占用情况,只占用4.67G
下面是国产卡,推理过程速度耗时如下图
相对于nvidia 国产卡做的已经很不错,从无到有已经很厉害,服务不错,技术24小时在线,遇到问题很快得到解决。
性能方面需要提升,软件和生态需要提升,这离不开用户群体,算力非常划算,值得日常使用微调推理训练大模型。如果对卡有要求的话,官网上有nvidan 系列显卡,很划算。