0.01 /小时，使用超算互联网https://www.scnet.cn/国产卡推理微调大模型，初体验

news2025/4/16 3:32:32

0.01 /小时，使用超算互联网 https://www.scnet.cn/ 国产卡推理微调大模型，初体验

官网购买算力,国产卡活动0.01 /小时，非常划算
活动地址
https://www.scnet.cn/home/subject/modular/index264.html
在这里插入图片描述
扫码入群，每天领算力优惠券，入群介绍记得填 wmx_scnet 推荐，谢谢大家

加速卡：异构加速卡AI * 1卡
显存：64GB
处理器：15核心 2*7490 64C
内存：110GB
镜像：jupyterlab-pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10

大模型 : qwen

1 环境搭建

进入上面已经购买的算力 notebook 环境，搭建开发环境

在 /public/home/wmx_scnet/ 路径下 , wmx_scnet 是账号用户名
clone代码 https://github.com/QwenLM/Qwen
cd Qwen
安装依赖 pip install -r requirements.txt

报错：

lease a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: transformers, transformers_stream_generator
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.0
    Uninstalling transformers-4.38.0:
      Successfully uninstalled transformers-4.38.0
      ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.3.3+das1.1.gitdf6349c.abi1.dtk2404.torch2.1.0 requires transformers>=4.38.0, but you have transformers 4.37.2 which is incompatible.
Successfully installed transformers-4.37.2 transformers_stream_generator-0.0.4
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
notice] A new release of pip is available: 24.0 -> 24.1.2

解决：
pip install --upgrade pip
重新运行 pip install -r requirements.txt

2运行cli_demo.py 推理

python cli_demo.py
先下载模型 Qwen-7B-Chat ，默认下载路径是 /root/.cache/huggingface

修改默认的路径为 $SCNET_USER_HOME/huggingface_cache

export HUGGINGFACE_HUB_CACHE= $SCNET_USER_HOME/huggingface_cache
mv /root/.cache/huggingface $SCNET_USER_HOME/huggingface_cache

再次运行 python cli_demo.py 不生效，还是从原来位置检查模型,没有按照环境变量 HUGGINGFACE_HUB_CACHE 指定的目录加载模型，需要手动加载

python cli_demo.py -c /public/home/wmx_scnet/huggingface_cache/hub/models--Qwen--Qwen-7B-Chat/snapshots/93a65d34827a3cc269b727e67004743b723e2f83

首次运行，大模型乱说

python cli_demo.py   
 
User: 你是谁

Qwen-Chat: "门不吃禁还要不乘没有了息之失。
。饿逊是谁。恒江是你。你是一只恒江。汉江是我。汉江是我。你好江，你好江。你是一个好江。好的江你好江。好的江你好江。你好江。你好江。你好江。你好江。好江你好江。你好江。好江你好江。你好江。你好江。好的江你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好江。你好

国产卡 hy-smi 这命令类似 nvidia-smi
VRAM 占用 47% ，大概30G
反馈给客服，之后解决

再次运行cli_demo.py , 能正确回答问题，但是非常慢
关闭之后再次运行，发现显卡内存没有释放 hy-smi VRAM 占用 97%
需要重启实例

python cli_demo.py -c /public/home/wmx_scnet/huggingface_cache/hub/models--Qwen--Qwen-7B-Chat/snapshots/93a65d34827a3cc269b727e67004743b723e2f83
Your device support faster inference by passing bf16=True in "AutoModelForCausalLM.from_pretrained".
/opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py:1384: UserWarning: Current model requires 4194560 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.
  warnings.warn(
Loading checkpoint shards:   0%|                                                                                                                                                                                                                                                             | 0/8 [00:00<?, ?it/s]
libgomp: Thread creation failed: Resource temporarily unavailable

3微调qwen模型：

使用e-file 上传模型文件到 /public/home/wmx_scnet/Qwen-1_8B-Chat
上传数据文件到/public/home/wmx_scnet/DISC-Law-SFT/train_data_law.json
修改Qwen/finetune/finetune_lora_single_gpu.sh ,指定模型和数据

MODEL=‘/public/home/wmx_scnet/Qwen-1_8B-Chat’
DATA=“/public/home/wmx_scnet/DISC-Law-SFT/train_data_law.json”

修改微调参数finetune_lora_single_gpu.sh :

python finetune.py \
  --model_name_or_path $MODEL \
  --data_path $DATA \
  --bf16 True \
  --output_dir output_qwen \
  --num_train_epochs 5 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 200 \
  --save_total_limit 10 \
  --learning_rate 3e-4 \
  --weight_decay 0.1 \
  --adam_beta2 0.95 \
  --warmup_ratio 0.01 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --report_to "none" \
  --model_max_length 512 \
  --lazy_preprocess True \
  --gradient_checkpointing \
  --use_lora

启动微调
sh finetune/finetune_lora_single_gpu.sh
初始显示预计需要7小时，总共310次迭代it,每次迭代大概76ms
过程中截图：
在这里插入图片描述
占用显存 hy-smi查看是65%

在这里插入图片描述
微调完毕，合并模型
输出在 /public/home/wmx_scnet/Qwen/output_qwen/checkpoint-200
使用脚本qwen_lora_merge.py合并模型
输出的模型在 /public/home/wmx_scnet/Qwen-1_8B-Chat_law_merge
推理合并后的模型

python cli_demo.py -c /public/home/wmx_scnet/Qwen-1_8B-Chat_law_merge

输出
在这里插入图片描述

修改cli_demo.py 指定 fp32=False

    model = AutoModelForCausalLM.from_pretrained(
        args.checkpoint_path,
        device_map=device_map,
        fp32=False,
        trust_remote_code=True,
        resume_download=False,
    ).eval()

因为训练使用bf16,在模型配置config.json中指定了bf16=true，所以做出上面修改
发现运行之后推理乱码输出
在这里插入图片描述
把模型改为fp32推理正常

微调的结果法律数据问答

总结：

6 微调使用国产卡非常慢，大概需要7小时，使用nvidia-4070ti-super 需要大概2个半小时，差距很大
4070tisuper 微调参数：
python finetune.py
–model_name_or_path $MODEL
–data_path $DATA
–bf16 False
–output_dir output_qwen
–num_train_epochs 5
–per_device_train_batch_size 8
–per_device_eval_batch_size 1
–gradient_accumulation_steps 8
–evaluation_strategy “no”
–save_strategy “steps”
–save_steps 200
–save_total_limit 10
–learning_rate 3e-4
–weight_decay 0.1
–adam_beta2 0.95
–warmup_ratio 0.01
–lr_scheduler_type “cosine”
–logging_steps 1
–report_to “none”
–model_max_length 512
–lazy_preprocess True
–gradient_checkpointing
–use_lora