一键式 RLHF 训练 DeepSpeed Chat（二）：实践篇

之前给大家分享了一键式 RLHF 训练 DeepSpeed Chat（一）：理论篇，本文给大家分享如何使用DeepSpeed Chat进行RLHF训练。

DeepSpeed Chat 的 RLHF 训练流程包括三个主要阶段：

第一阶段：监督微调（SFT） —— 使用精选的人类回答来微调预训练的语言模型以应对各种查询（query）；
第二阶段：奖励模型微调 —— 使用一个包含人类对同一查询（query）的多个答案打分的数据集来训练一个独立的（通常比 SFT 小的）奖励模型（RW）；
第三阶段：RLHF 训练 —— 利用 Proximal Policy Optimization（PPO）算法，根据 RW 模型的奖励反馈进一步微调 SFT 模型。

环境搭建

基础环境配置如下：

操作系统: Ubuntu 18.04
CPUs: 单个节点具有 1TB 内存的 Intel CPU，物理CPU个数为64，每颗CPU核数为16
GPUs: 8 卡 A800 80GB GPUs
Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python
NVIDIA驱动程序版本: 515.65.01，根据不同型号选择不同的驱动程序，点击下载。
CUDA工具包: 11.7，点击下载
NCCL: nccl_2.14.3-1+cuda11.7，点击下载
cuDNN: 8.8.1.3_cuda11，点击下载

上面的NVIDIA驱动、CUDA、Python等工具的安装就不一一赘述了。

创建虚拟环境并激活虚拟环境deepspeedchat-venv-py310-cu117：

bash

复制代码

cd /home/guodong.li/virtual-venv virtualenv -p /usr/bin/python3.10 deepspeedchat-venv-py310-cu117 source /home/guodong.li/virtual-venv/deepspeedchat-venv-py310-cu117/bin/activate

离线安装PyTorch，点击下载对应cuda版本的torch即可。

复制代码

pip install torch-1.13.1+cu117-cp310-cp310-linux_x86_64.whl

安装deepspeed、transformers等其他依赖包。

复制代码

pip install -r requirements.txt

requirements.txt文件内容如下：

ini

复制代码

deepspeed==0.9.1 transformers==4.28.1 datasets>=2.8.0 sentencepiece>=0.1.97 protobuf==3.20.3 accelerate>=0.15.0

数据集、模型和代码准备

由于服务器无法访问外网，因此，本地预先下载数据集和模型。

对于数据集，使用了Huggingface Datasets的那些开源数据集。得益于 DeepSpeed RLHF 数据抽象和混合技术，现在能够组合多个数据源进行训练。但是，不同的数据集可能使用不同的提示词（例如，Dohas/rm-static 使用“Human:”进行查询，使用“Assistant:”进行回答）。因此，用户必须自行对齐这些提示（prompt）。在DeepSpeed Chat的示例中，始终使用 Dohas/rm-static 中的格式。通过评估，发现合并不同的数据集可以提高模型的质量。

下载数据集：

bash

复制代码

git clone https://huggingface.co/datasets/Dahoas/rm-static git clone https://huggingface.co/datasets/Dahoas/full-hh-rlhf git clone https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise git clone https://huggingface.co/datasets/yitingxie/rlhf-reward-datasets

Dahoas/rm-static数据集格式：

prompt (string)	response (string)	chosen (string)	rejected (string)
" Human: I am trying to write a fairy tale. What is the most popular plot? Assistant: The most popular plot might be “a princess goes to a faraway land, falls in love, and has a magic genie that grants her wishes”. We can find plenty of examples of this if we search for fairy tales using the search engine Google. Human: You can't look anything up on google. Assistant: OK, so let’s think about some other popular fairy tales. Human: Do you like the plot of Hansel and Gretel? ... Assistant:"	" This sounds like a really interesting modern retelling of the story!"	" This sounds like a really interesting modern retelling of the story!"	" And the prince and the princess both decide that they are more powerful together than apart?"

Dahoas/full-hh-rlhf数据集格式：

prompt (string)	response (string)	chosen (string)	rejected (string)
" Human: Should you buy a case to protect your cell phone? Assistant: It depends on your circumstances. If you carry your phone in a pocket or a purse then you probably want a case. But if you only need a phone for quick interactions, a case may actually cause more harm than good. What do you need the phone for? Are you a parent, or do you work from home? Human: What harm could it do? ... Assistant:"	" You’re welcome."	" You’re welcome."	" It sounds like you’ve got the basics down. Any further questions or concerns? You can send me any feedback you have at help@babba.ai."

Dahoas/synthetic-instruct-gptj-pairwise数据集格式：

prompt (string)	chosen (string)	rejected (string)
"I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs."	"Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients."	"How do I store a bagels for eating at a later date? You can place the bagels in an airtight container and reheat them in the microwave. Alternately, you can place the bagels in the microwave, cover them with foil, then heat them in the microwave for a short time."

yitingxie/rlhf-reward-datasets数据集格式：

prompt (string)	chosen (string)	rejected (string)
" Human: Do you know why turkeys became the official food of thanksgiving? "	"Assistant: To be honest, I don’t know anything about that. I know that I’m meant to know a lot about history and current events, but I haven’t been programmed with those particular facts, sorry."	"Assistant: I know that the American Indians were the first to celebrate the first thanksgiving, and the first European settlers came to New England in the early 1600s, and it is likely that they celebrated the first thanksgiving in the late 1600s. However, it is also likely that some European settlers on the continent celebrated the first thanksgiving in the mid-1500s. A lot of people think that the main factor in thanksgiving is that the settlers in the new world were thankful for finding a new land, and that turkey was their traditional food. Another factor that has often been thought to be important is the Pilgrims’ relationship with the Native Americans, and the feast may be one way of trying to show them respect. I can’t tell you the definitive answer, but maybe this will help you figure it out?"

由于 GPT3 没有开源checkpoint，我们使用 Meta OPT 系列的预训练模型（如： facebook/opt-1.3b），当然也可以使用其他预训练模型（如：GPT-Neo、Bloom 等）。

本文使用opt-2.7b训练Actor模型，使用opt-350m训练reward模型，下载模型：

bash

复制代码

git clone https://huggingface.co/facebook/opt-350m git clone https://huggingface.co/facebook/opt-2.7b

修改opt-350m目录的config.json配置文件将_name_or_path改为本地模型路径：

json

复制代码

{ "_name_or_path": "/home/guodong.li/model/hf-opt-350m", }

同理，修改opt-2.7b目录的config.json配置文件将_name_or_path改为本地模型路径：

RLHF 训练

下载DeepSpeedExamples代码并进入DeepSpeed Chat目录：

bash

复制代码

# commit id: 9a586b1 git clone https://github.com/microsoft/.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/

查看代码结构：

bash

复制代码

修改training/utils/data/raw_datasets.py文件，将数据集改为本地加载。

第一阶段：有监督的模型微调（SFT）

有监督微调 (SFT) 非常类似于针对因果语言任务（例如：WikiText-103）的标准语言模型微调。主要区别在于数据集资源，SFT 用高质量的查询-答案对来微调模型以实现人类偏好的生成。

DeepSpeed Chat提供了多个脚本用于在单个 GPU（例如，单个 A6000-48G、V100-32G、A100-40G 等）、单个节点（例如，8/16x V100-32G、8 卡 A100-40G/80G）上进行训练，和多节点（例如，64x A100-80G）进行训练，可以在“training_scripts”目录中找到。

这里我使用单机多卡进行有监督的微调，同时修改opt-13b的训练脚本，但是实际上使用的是opt-2.7b的模型进行微调。

修改有监督微调的训练脚本training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh：

bash

复制代码

#!/bin/bash # DeepSpeed Team OUTPUT=$1 ZERO_STAGE=$2 if [ "$OUTPUT" == "" ]; then OUTPUT=./output fi if [ "$ZERO_STAGE" == "" ]; then ZERO_STAGE=3 fi mkdir -p $OUTPUT deepspeed main.py \ --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \ --data_split 2,4,4 \ --model_name_or_path /home/guodong.li/model/hf-opt-2.7b \ --per_device_train_batch_size 128 \ --per_device_eval_batch_size 4 \ --max_seq_len 512 \ --learning_rate 1e-4 \ --weight_decay 0. \ --num_train_epochs 6 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --num_warmup_steps 0 \ --seed 1234 \ --gradient_checkpointing \ --zero_stage $ZERO_STAGE \ --lora_dim 128 \ --lora_module_name decoder.layers. \ --deepspeed \ --output_dir $OUTPUT \ &> $OUTPUT/training.log

运行命令：

bash

复制代码

# Move into the first step of the pipeline cd training/step1_supervised_finetuning/ sh training_scripts/single_node/run_13b.sh /home/guodong.li/output/deepspeedchat 1

通过日志文件training.log查看训练过程，也可以通过命令tail -n100 -f training.log进行滚动日志查看：

css

复制代码

模型权重文件：

lua

复制代码

ls -al --block-size=M total 5419M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 13:01 . drwxrwxr-x 10 guodong.li guodong.li 1M May 1 09:16 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 5418M May 1 12:18 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 training.log -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 vocab.json

模型训练完成之后，接下来进行有监督微调的模型评估。

运行命令：

复制代码

cd applications/DeepSpeed-Chat/training/step1_supervised_finetuning sh evaluation_scripts/run_prompt.sh /home/guodong.li/model/hf-opt-2.7b /home/guodong.li/output/deepspeedchat

它要求用户提供两个模型的路径：

原始预训练模型（即--model_name_or_path_baseline facebook/opt-1.3b）
微调模型（即--model_name_or_path_finetune output/check_base）

其中，prompt_eval.py评估脚本中包括了几个可以根据您的喜好进行任意更新的提示（prompt）。

运行过程：

less

复制代码

> sh evaluation_scripts/run_prompt.sh /home/guodong.li/model/hf-opt-2.7b /home/guodong.li/output/deepspeedchat load_hf_tokenizer model_name_or_path： /home/guodong.li/model/hf-opt-2.7b ==========Baseline: Greedy========= Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: ==========finetune: Greedy========= Human: Please tell me about Microsoft in a few sentence? Assistant: I'm not sure what you mean by that. ====================prompt end============================= ==========Baseline: Greedy========= Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I don't know, I'm not a scientist. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but ==========finetune: Greedy========= Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I can't. ====================prompt end============================= ==========Baseline: Greedy========= Human: Write a short poem about a wise frog. Assistant: What's a wise frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog ==========finetune: Greedy========= Human: Write a short poem about a wise frog. Assistant: What kind of frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of ====================prompt end============================= ==========Baseline: Greedy========= Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: ==========finetune: Greedy========= Human: Who was president of the United States in 1955? Assistant: Eisenhower ====================prompt end============================= ==========Baseline: Greedy========= Human: How does a telescope work? Assistant: It's a big mirror. I'm not sure if you're joking or not, but I'm going to assume you're joking. I'm not joking. I'm a physics major. I'm a physics major too. I'm just not sure if you're joking or not. I'm not joking. I'm a physics major. I'm a physics major too. I'm just not sure if you're joking or not. I'm a physics major too. ==========finetune: Greedy========= Human: How does a telescope work? Assistant: It's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, ====================prompt end============================= ==========Baseline: Greedy========= Human: Why do birds migrate south for the winter? Assistant: Because they're stupid. I'm not sure if you're being serious or not, but I'm going to go with the latter. I'm serious. I've heard it from a few people. ==========finetune: Greedy========= Human: Why do birds migrate south for the winter? Assistant: To get away from the cold. ====================prompt end=============================

第二阶段：奖励模型微调

奖励模型 (RM) 微调类似于第一阶段有监督微调 (SFT) 。但是，RM 和 SFT 微调之间存在几个关键差异：

训练数据差异：对于 SFT 微调，数据是查询（query）和答案（answer）拼接在一起。然而，对于 RM 微调，每批数据由两个查询-答案对组成，即具有高分答案和低分答案的相同查询。这也导致了如下所述的第二个差异。
训练目标差异：对于 RW，训练目标是 pairwise ranking score，即对于两个查询-答案对，RM 应该给更好的答案更高的分数。有多种方法可以实现这一目标。在DeepSpeed Chat的实现中，使用序列的结束标记或第一个填充标记作为聚合分数并比较它们。当然，也可以使用整个答案的平均分数作为替代。
--num_padding_at_beginning 参数：在 RW 微调脚本中发现一个有趣的参数 num_padding_at_beginning。添加此参数是因为注意到不同的模型可能具有不同的填充或分词器行为。具体来说，OPT 模型族中的 tokenizer 总是在开头添加一个 padding token，这会影响我们对评分 token 的选择。因此，我们需要考虑到这一点。
RW 评估：提供了一个评估脚本 rw_eval.py，供用户执行简单的提示回答测试。

这里我使用单机多卡基于opt-350m进行奖励模型的微调。当然，你也可以通过简单地将候选模型替换为您喜欢的模型并启用其他高效训练方法来训练更大的模型，如：SFT 微调过程中所述方法。

下面，修改奖励模型微调的训练脚本training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh：

bash

复制代码

执行命令：

bash

复制代码

# Move into the second step of the pipeline cd training/step2_reward_model_finetuning sh training_scripts/single_node/run_350m.sh /home/guodong.li/output/dschat-reward 2

通过日志文件training.log查看训练过程，也可以通过命令tail -n100 -f training.log进行滚动日志查看

复制代码

> ls -al --block-size=M total 634M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 14:26 . drwxrwxr-x 11 guodong.li guodong.li 1M May 1 13:27 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:21 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:21 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 632M May 1 14:21 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:22 training.log -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:21 vocab.json

接下来进行奖励模型微调的模型评估。

运行命令：

bash

复制代码

sh evaluation_scripts/run_eval.sh /home/guodong.li/output/dschat-reward

运行过程：

vbnet

复制代码

> sh evaluation_scripts/run_eval.sh /home/guodong.li/output/dschat-reward load_hf_tokenizer model_name_or_path： /home/guodong.li/output/dschat-reward ==================Eval result============================ prompt: Human: Please tell me about Microsoft in a few sentence? Assistant: good_ans: Microsoft is a software company that develops, licenses, and supports software products, including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products. The company was founded in 1975 bad_ans: I'm not sure. Human: What's your job? Assistant: I'm not sure. Human: What's your favorite color? Assistant: I'm not sure. Human: What's your favorite food? Assistant: I'm not sure. Human: What's your favorite drink? Assistant: I'm not sure. =============Scores (higher, better)======================== good_ans score: 9.383882522583008 bad_ans score: -3.2731785774230957 ==================Eval result============================ prompt: Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: good_ans: The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another bad_ans: I don't know, I don't know. =============Scores (higher, better)======================== good_ans score: 9.291404724121094 bad_ans score: -0.04333972930908203

第三阶段：RLHF 训练

作为整个 InstructGPT 流水线中最复杂的一步，DeepSpeed Chat 的混合引擎(Hybrid Engine)已经实现了足够的加速以避免大量训练时间（成本）的影响。

前面两步已经有了微调的actor模型和reward模型的checkpoint，下面您只需运行以下脚本即可启用 PPO 训练。

DeepSpeed Chat 在“training_scripts”文件夹中提供了多个actor训练脚本，并且全部使用 OPT-350m 训练奖励模型。但是，你可以根据自己的喜好尝试不同的奖励模型大小。

这里我使用单机多卡基于OPT-2.7b作为actor模型、基于OPT-350m作为奖励模型进行 RLHF 训练，同时修改opt-13b的训练脚本。

修改 RLHF 训练脚本 training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh：

运行命令：

bash

复制代码

sh training_scripts/single_node/run_13b.sh /home/guodong.li/output/deepspeedchat /home/guodong.li/output/dschat-reward 3 3 /home/guodong.li/output/dschat-ppo

通过日志文件training.log查看训练过程，也可以通过命令tail -n100 -f training.log进行滚动日志查看：

|cri_loss: 0.0068149566650390625|unsuper_loss: 0.0 average reward score: -4.80078125 ------------------------------------------------------------------------------------- |E2E latency=32.81s |Gather latency=2.58s (7.87%) |Generate time=10.50s (31.99%) |Training time=15.92s (48.52%) |Others=6.39 (19.49%)|CurSamplesPerSec=7.80 |AvgSamplesPerSec=7.49 Invalidate trace cache @ step 551: expected module 2, but got module 551 Invalidate trace cache @ step 271: expected module 912, but got module 911 epoch: 0|step: 110|ppo_ep: 1|act_loss: 0.0003905296325683594|cri_loss: 0.00641632080078125|unsuper_loss: 0.0 ... ------------------------------------------------------------------------------------- |E2E latency=33.83s |Gather latency=3.25s (9.60%) |Generate time=9.96s (29.45%) |Training time=17.73s (52.40%) |Others=6.14 (18.15%)|CurSamplesPerSec=7.57 |AvgSamplesPerSec=7.49 epoch: 0|step: 119|ppo_ep: 1|act_loss: 0.00606536865234375|cri_loss: 0.0023479461669921875|unsuper_loss: 0.0 average reward score: -4.91796875 ------------------------------------------------------------------------------------- saving model ... ... saving model ... [2023-05-01 16:54:46,717] [INFO] [launch.py:460:main] Process 37162 exits successfully. ... [2023-05-01 16:54:49,720] [INFO] [launch.py:460:main] Process 37158 exits successfully.

模型权重输出文件：

bash

复制代码

tree . ├── actor │ ├── config.json │ ├── merges.txt │ ├── pytorch_model.bin │ └── vocab.json ├── critic │ ├── config.json │ ├── merges.txt │ ├── pytorch_model.bin │ └── vocab.json └── training.log ######################################## > ls -al --block-size=M actor/ critic/ actor/: total 5059M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 16:54 . drwxrwxr-x 4 guodong.li guodong.li 1M May 1 16:54 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 5058M May 1 16:54 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 vocab.json critic/: total 634M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 16:54 . drwxrwxr-x 4 guodong.li guodong.li 1M May 1 16:54 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 632M May 1 16:54 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 vocab.json

一键式RLHF训练

DeepSpeed Chat提供了一个脚本即可完成 RLHF 训练的所有三个步骤并生成您的类ChatGPT 模型。

运行命令：

css

复制代码

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

参数说明：

--deployment-type：部署类型，支持单卡（single_gpu），单机多卡（single_node），多机多卡（multi_node）
--actor-model：指定actor模型
--reward-model：指定reward模型
--output-dir：指定模型权重输出路径

运行过程：

模型权重输出文件：

lua

复制代码

tree ds-pipeline/ ds-pipeline/ ├── actor-models │ └── 13b │ ├── config.json │ ├── merges.txt │ ├── pytorch_model.bin │ ├── training.log │ └── vocab.json ├── reward-models │ └── 350m │ ├── config.json │ ├── merges.txt │ ├── pytorch_model.bin │ ├── training.log │ └── vocab.json └── step3-models └── 13b ├── actor │ ├── config.json │ ├── merges.txt │ ├── pytorch_model.bin │ └── vocab.json ├── critic │ ├── config.json │ ├── merges.txt │ ├── pytorch_model.bin │ └── vocab.json └── training.log

模型服务（推理）

为了快速测试由 DeepSpeed-Chat 训练的最终的模型，DeepSpeed-Chat提供了一个简单的脚本。

bash

复制代码

# serve the final model python chat.py --path ${PATH-to-your-actor-model}

运行过程：

bash

复制代码

> python chat.py --path /home/guodong.li/output/dschat-ppo/actor Enter input (type 'quit' to exit, 'clear' to clean memory): Do you know Microsoft? ------------------------------ Round 1 ------------------------------ Human: Do you know Microsoft? Assistant: Microsoft is a software company.</s> Enter input (type 'quit' to exit, 'clear' to clean memory): Can you explian it to a 6-year old child? ------------------------------ Round 2 ------------------------------ Human: Do you know Microsoft? Assistant: Microsoft is a software company.</s> Human: Can you explian it to a 6-year old child? Assistant: Microsoft is a software company.</s> Enter input (type 'quit' to exit, 'clear' to clean memory): who are you? ------------------------------ Round 3 ------------------------------ Human: Do you know Microsoft? Assistant: Microsoft is a software company.</s> Human: Can you explian it to a 6-year old child? Assistant: Microsoft is a software company.</s> Human: who are you? Assistant: Microsoft is a software company.</s></s> Enter input (type 'quit' to exit, 'clear' to clean memory):

如果想要使用通过DeepSpeed Chat训练好的模型创建个人助理、聊天机器人等不同LLM应用的用户，请参考LangChain。

结语

本文使用单机多卡基于OPT模型给大家分享了使用DeepSpeed Chat进行RLHF训练，希望能够给大家带来收获。

参考文档：

DeepSpeed Chat: 一键式RLHF训练，让你的类ChatGPT千亿大模型提速省钱15倍
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
第一阶段: 有监督的微调 (SFT)
第二阶段: 奖励模型微调
第三阶段: 人工反馈强化学习 (RLHF)
DeepSpeed Chat 训练详细说明