Deepspeed Chat项目理解

news2025/7/6 5:21:22

ChatGPT的广泛使用促进大模型火起来了，深度学习人工智能开启了2.0时代，deepspeed chat是微软开源的大模型训练工具，它充分利用了deepspeed的高效训练的特点，能够自动化的进行多种大模型训练。

言归正传，在博客中我将对我的实验的流程和运行方式做个记录，便于自己后续的查看，也给其他初学者一点参考。项目：Deepspeed chat链接，资源库：hugging face

前述项目就是Deepspeed chat的官方项目，可以在按照其readme所述的流程进行操作，整体的操作流程搬运到下面：

pip install deepspeed>=0.9.0

git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/
pip install -r requirements.txt

如上安装完环境后，进行训练，训练的脚本如下：

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node

分别表示单GPU，单节点，多节点的一次性的运行方式，之所以称其为一次性运行方式，是因为deepspeed chat在大模型finetune的过程中主要分为3步，参考GPT-3论文: GPT Finetune 3步走
使用上述方式如果遇到问题不好定位，因此，可以采用如下方式分步骤进行，分步方式如下：

step1：

# Move into the first step of the pipeline
cd training/step1_supervised_finetuning/

# Run the training script
bash training_scripts/single_gpu/run_1.3b.sh

# Evaluate the model
bash evaluation_scripts/run_prompt.sh

step2：

# Move into the second step of the pipeline
cd training/step2_reward_model_finetuning

# Run the training script
bash training_scripts/single_gpu/run_350m.sh

# Evaluate the model
bash evaluation_scripts/run_eval.sh

step3：

# Move into the final step of the pipeline
cd training/step3_rlhf_finetuning/

# Run the training script
bash training_scripts/single_gpu/run_1.3b.sh

资源库
按照如上方式，运行时还是会遇到问题，因为deepspeed chat是默认在hugging face上拉取的模型和数据，但是hugging face是国外的网站，在国内ip很难连接上，容易出现类似ConnectionResetError的问题，因此需要在本地下载了上传上去，对应的模型在hugging face上可以搜索到。

参照教程：deepspeed chat替换模型和数据，将模型替换为LLMZoo中的模型和数据，使用的模型为bigscience/bloomz-1b1，数据为phoenix-sft-data-v1或者其他数据，例如Dahoas/rm-static。将model_name_or_path参数设置为bigscience/bloomz-1b1，data_path参数设置为本地数据路径或者远程数据路径。

遇到问题：若出现网络连接失败，则需要另外单独下载资源，其中较小的文件可以使用git lfs和git clone下载，较大的数据（如大模型，大数据集）需要借助代理拉取，使用git下载的方式如图：（git lfs是用来下载大数据的）

git下载资源方式
对于更大的数据，需要使用本地下载方式，文件可以在工程下查看单独下载：