一键式 RLHF 训练 DeepSpeed Chat(二):实践篇

news2024/11/24 19:30:37

一键式 RLHF 训练 DeepSpeed Chat(二):实践篇

之前给大家分享了一键式 RLHF 训练 DeepSpeed Chat(一):理论篇,本文给大家分享如何使用DeepSpeed Chat进行RLHF训练。

DeepSpeed Chat 的 RLHF 训练流程包括三个主要阶段:

  • 第一阶段:监督微调(SFT  —— 使用精选的人类回答来微调预训练的语言模型以应对各种查询(query);
  • 第二阶段:奖励模型微调 —— 使用一个包含人类对同一查询(query)的多个答案打分的数据集来训练一个独立的(通常比 SFT 小的)奖励模型(RW);
  • 第三阶段:RLHF 训练 —— 利用 Proximal Policy Optimization(PPO)算法,根据 RW 模型的奖励反馈进一步微调 SFT 模型。

环境搭建

基础环境配置如下:

  • 操作系统: Ubuntu 18.04
  • CPUs: 单个节点具有 1TB 内存的 Intel CPU,物理CPU个数为64,每颗CPU核数为16
  • GPUs: 8 卡 A800 80GB GPUs
  • Python: 3.10 (需要先升级OpenSSL到1.1.1t版本(点击下载OpenSSL),然后再编译安装Python),点击下载Python
  • NVIDIA驱动程序版本: 515.65.01,根据不同型号选择不同的驱动程序,点击下载。
  • CUDA工具包: 11.7,点击下载
  • NCCL: nccl_2.14.3-1+cuda11.7,点击下载
  • cuDNN: 8.8.1.3_cuda11,点击下载

上面的NVIDIA驱动、CUDA、Python等工具的安装就不一一赘述了。

创建虚拟环境并激活虚拟环境deepspeedchat-venv-py310-cu117:

 

bash

复制代码

cd /home/guodong.li/virtual-venv virtualenv -p /usr/bin/python3.10 deepspeedchat-venv-py310-cu117 source /home/guodong.li/virtual-venv/deepspeedchat-venv-py310-cu117/bin/activate

离线安装PyTorch,点击下载对应cuda版本的torch即可。

 

复制代码

pip install torch-1.13.1+cu117-cp310-cp310-linux_x86_64.whl

安装deepspeed、transformers等其他依赖包。

 

复制代码

pip install -r requirements.txt

requirements.txt文件内容如下:

 

ini

复制代码

deepspeed==0.9.1 transformers==4.28.1 datasets>=2.8.0 sentencepiece>=0.1.97 protobuf==3.20.3 accelerate>=0.15.0

数据集、模型和代码准备

由于服务器无法访问外网,因此,本地预先下载数据集和模型。

对于数据集,使用了Huggingface Datasets的那些开源数据集。得益于 DeepSpeed RLHF 数据抽象和混合技术,现在能够组合多个数据源进行训练。 但是,不同的数据集可能使用不同的提示词(例如,Dohas/rm-static 使用“Human:”进行查询,使用“Assistant:”进行回答)。 因此,用户必须自行对齐这些提示(prompt)。 在DeepSpeed Chat的示例中,始终使用 Dohas/rm-static 中的格式。通过评估,发现合并不同的数据集可以提高模型的质量。

下载数据集:

 

bash

复制代码

git clone https://huggingface.co/datasets/Dahoas/rm-static git clone https://huggingface.co/datasets/Dahoas/full-hh-rlhf git clone https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise git clone https://huggingface.co/datasets/yitingxie/rlhf-reward-datasets

Dahoas/rm-static数据集格式:

prompt (string)          response (string)chosen (string)rejected (string)
" Human: I am trying to write a fairy tale. What is the most popular plot? Assistant: The most popular plot might be “a princess goes to a faraway land, falls in love, and has a magic genie that grants her wishes”. We can find plenty of examples of this if we search for fairy tales using the search engine Google. Human: You can't look anything up on google. Assistant: OK, so let’s think about some other popular fairy tales. Human: Do you like the plot of Hansel and Gretel? ... Assistant:"" This sounds like a really interesting modern retelling of the story!"" This sounds like a really interesting modern retelling of the story!"" And the prince and the princess both decide that they are more powerful together than apart?"

Dahoas/full-hh-rlhf数据集格式:

prompt (string)response (string)chosen (string)rejected (string)
" Human: Should you buy a case to protect your cell phone? Assistant: It depends on your circumstances. If you carry your phone in a pocket or a purse then you probably want a case. But if you only need a phone for quick interactions, a case may actually cause more harm than good. What do you need the phone for? Are you a parent, or do you work from home? Human: What harm could it do? ... Assistant:"" You’re welcome."" You’re welcome."" It sounds like you’ve got the basics down. Any further questions or concerns? You can send me any feedback you have at help@babba.ai."

Dahoas/synthetic-instruct-gptj-pairwise数据集格式:

prompt (string)chosen (string)rejected (string)
"I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs.""Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients.""How do I store a bagels for eating at a later date? You can place the bagels in an airtight container and reheat them in the microwave. Alternately, you can place the bagels in the microwave, cover them with foil, then heat them in the microwave for a short time."

yitingxie/rlhf-reward-datasets数据集格式:

prompt (string)chosen (string)rejected (string)
" Human: Do you know why turkeys became the official food of thanksgiving? ""Assistant: To be honest, I don’t know anything about that. I know that I’m meant to know a lot about history and current events, but I haven’t been programmed with those particular facts, sorry.""Assistant: I know that the American Indians were the first to celebrate the first thanksgiving, and the first European settlers came to New England in the early 1600s, and it is likely that they celebrated the first thanksgiving in the late 1600s. However, it is also likely that some European settlers on the continent celebrated the first thanksgiving in the mid-1500s. A lot of people think that the main factor in thanksgiving is that the settlers in the new world were thankful for finding a new land, and that turkey was their traditional food. Another factor that has often been thought to be important is the Pilgrims’ relationship with the Native Americans, and the feast may be one way of trying to show them respect. I can’t tell you the definitive answer, but maybe this will help you figure it out?"

由于 GPT3 没有开源checkpoint,我们使用 Meta OPT 系列的预训练模型(如: facebook/opt-1.3b),当然也可以使用其他预训练模型(如:GPT-Neo、Bloom 等)。

本文使用opt-2.7b训练Actor模型,使用opt-350m训练reward模型,下载模型:

 

bash

复制代码

git clone https://huggingface.co/facebook/opt-350m git clone https://huggingface.co/facebook/opt-2.7b

修改opt-350m目录的config.json配置文件将_name_or_path改为本地模型路径:

 

json

复制代码

{ "_name_or_path": "/home/guodong.li/model/hf-opt-350m", }

同理,修改opt-2.7b目录的config.json配置文件将_name_or_path改为本地模型路径:

RLHF 训练

下载DeepSpeedExamples代码并进入DeepSpeed Chat目录:

 

bash

复制代码

# commit id: 9a586b1 git clone https://github.com/microsoft/.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/

查看代码结构:

 

bash

复制代码

> tree . |____training # 训练 | |____utils # 工具类 | | |____utils.py | | |____model # 模型工具类 | | | |____reward_model.py | | | |____model_utils.py | | |____module | | | |____lora.py | | |____ds_utils.py | | |____data # 数据处理工具类 | | | |____data_utils.py | | | |____raw_datasets.py | |____step1_supervised_finetuning # 第一阶段:有监督微调 | | |____training_log_output | | | |____opt-1.3b-globalBatchSize128.log | | |____main.py | | |____training_scripts # 模型训练脚本 | | | |____other_language | | | | |____run_chinese.sh # 基于bloom的有监督微调 | | | | |____run_japanese.sh # 基于mGPT的有监督微调 | | | |____multi_node # 多机多卡训练脚本 | | | | |____run_66b.sh | | | |____README.md | | | |____single_node # 单机多卡训练脚本 | | | | |____run_1.3b_lora.sh | | | | |____run_13b.sh | | | | |____run_1.3b.sh | | | | |____run_30b_lora.sh | | | | |____run_6.7b.sh | | | |____single_gpu # 单卡训练脚本 | | | | |____run_6.7b_lora.sh | | | | |____run_1.3b.sh | | |____evaluation_scripts | | | |____run_prompt.sh | | |____README.md | | |____prompt_eval.py | |____step2_reward_model_finetuning # 第二阶段:奖励模型微调 | | |____rw_eval.py | | |____training_log_output | | | |____opt-350m_globalBatchSize-64.log | | |____main.py | | |____training_scripts 训练脚本 | | | |____multi_node # 多机多卡训练脚本 | | | | |____run_350m.sh | | | |____README.md | | | |____single_node # 单机多卡训练脚本 | | | | |____run_350m.sh | | | |____single_gpu # 单卡训练脚本 | | | | |____run_350m.sh | | |____evaluation_scripts # 模型评估脚本 | | | |____run_eval.sh | | |____README.md | |____README.md | |____step3_rlhf_finetuning # 人工反馈强化学习微调 | | |____ppo_trainer.py | | |____training_log_output | | | |____actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log | | |____main.py | | |____BenckmarkSetting.md | | |____training_scripts # 模型训练脚本 | | | |____multi_node # 多机多卡训练脚本 | | | | |____run_66b.sh | | | |____README.md | | | |____single_node # 单机多卡训练脚本 | | | | |____run_1.3b_lora.sh | | | | |____run_13b.sh | | | | |____run_1.3b.sh | | | | |____run_30b_lora.sh | | | | |____run_6.7b.sh | | | |____single_gpu # 单卡训练脚本 | | | | |____run_6.7b_lora.sh | | | | |____run_1.3b.sh | | |____rlhf_engine.py | | |____README.md |____train.py # 训练入口 |____chat.py |____README.md |____requirements.txt |____inference # 推理 | |____chatbot.py

修改training/utils/data/raw_datasets.py文件,将数据集改为本地加载。

 

第一阶段:有监督的模型微调(SFT

有监督微调 (SFT) 非常类似于针对因果语言任务(例如:WikiText-103)的标准语言模型微调。 主要区别在于数据集资源,SFT 用高质量的查询-答案对来微调模型以实现人类偏好的生成。

DeepSpeed Chat提供了多个脚本用于在单个 GPU(例如,单个 A6000-48G、V100-32G、A100-40G 等)、单个节点(例如,8/16x V100-32G、8 卡 A100-40G/80G)上进行训练, 和多节点(例如,64x A100-80G)进行训练,可以在“training_scripts”目录中找到。

这里我使用单机多卡进行有监督的微调,同时修改opt-13b的训练脚本,但是实际上使用的是opt-2.7b的模型进行微调。

修改有监督微调的训练脚本training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh

 

bash

复制代码

#!/bin/bash # DeepSpeed Team OUTPUT=$1 ZERO_STAGE=$2 if [ "$OUTPUT" == "" ]; then OUTPUT=./output fi if [ "$ZERO_STAGE" == "" ]; then ZERO_STAGE=3 fi mkdir -p $OUTPUT deepspeed main.py \ --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \ --data_split 2,4,4 \ --model_name_or_path /home/guodong.li/model/hf-opt-2.7b \ --per_device_train_batch_size 128 \ --per_device_eval_batch_size 4 \ --max_seq_len 512 \ --learning_rate 1e-4 \ --weight_decay 0. \ --num_train_epochs 6 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --num_warmup_steps 0 \ --seed 1234 \ --gradient_checkpointing \ --zero_stage $ZERO_STAGE \ --lora_dim 128 \ --lora_module_name decoder.layers. \ --deepspeed \ --output_dir $OUTPUT \ &> $OUTPUT/training.log

运行命令:

 

bash

复制代码

# Move into the first step of the pipeline cd training/step1_supervised_finetuning/ sh training_scripts/single_node/run_13b.sh /home/guodong.li/output/deepspeedchat 1

通过日志文件training.log查看训练过程,也可以通过命令tail -n100 -f training.log进行滚动日志查看:

 

css

复制代码

模型权重文件:

 

lua

复制代码

ls -al --block-size=M total 5419M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 13:01 . drwxrwxr-x 10 guodong.li guodong.li 1M May 1 09:16 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 5418M May 1 12:18 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 training.log -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 12:18 vocab.json

模型训练完成之后,接下来进行有监督微调的模型评估。

运行命令:

复制代码

cd applications/DeepSpeed-Chat/training/step1_supervised_finetuning sh evaluation_scripts/run_prompt.sh /home/guodong.li/model/hf-opt-2.7b /home/guodong.li/output/deepspeedchat

它要求用户提供两个模型的路径:

  • 原始预训练模型(即--model_name_or_path_baseline facebook/opt-1.3b
  • 调模型(即--model_name_or_path_finetune output/check_base

其中,prompt_eval.py评估脚本中包括了几个可以根据您的喜好进行任意更新的提示(prompt)。

运行过程:

 

less

复制代码

> sh evaluation_scripts/run_prompt.sh /home/guodong.li/model/hf-opt-2.7b /home/guodong.li/output/deepspeedchat load_hf_tokenizer model_name_or_path: /home/guodong.li/model/hf-opt-2.7b ==========Baseline: Greedy========= Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: ==========finetune: Greedy========= Human: Please tell me about Microsoft in a few sentence? Assistant: I'm not sure what you mean by that. ====================prompt end============================= ==========Baseline: Greedy========= Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I don't know, I'm not a scientist. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but I can tell you that the moon landing was faked. I'm not a scientist either, but ==========finetune: Greedy========= Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I can't. ====================prompt end============================= ==========Baseline: Greedy========= Human: Write a short poem about a wise frog. Assistant: What's a wise frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog ==========finetune: Greedy========= Human: Write a short poem about a wise frog. Assistant: What kind of frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of ====================prompt end============================= ==========Baseline: Greedy========= Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: ==========finetune: Greedy========= Human: Who was president of the United States in 1955? Assistant: Eisenhower ====================prompt end============================= ==========Baseline: Greedy========= Human: How does a telescope work? Assistant: It's a big mirror. I'm not sure if you're joking or not, but I'm going to assume you're joking. I'm not joking. I'm a physics major. I'm a physics major too. I'm just not sure if you're joking or not. I'm not joking. I'm a physics major. I'm a physics major too. I'm just not sure if you're joking or not. I'm a physics major too. ==========finetune: Greedy========= Human: How does a telescope work? Assistant: It's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, ====================prompt end============================= ==========Baseline: Greedy========= Human: Why do birds migrate south for the winter? Assistant: Because they're stupid. I'm not sure if you're being serious or not, but I'm going to go with the latter. I'm serious. I've heard it from a few people. ==========finetune: Greedy========= Human: Why do birds migrate south for the winter? Assistant: To get away from the cold. ====================prompt end=============================

第二阶段:奖励模型微调

奖励模型 (RM) 微调类似于第一阶段有监督微调 (SFT) 。 但是,RM 和 SFT 微调之间存在几个关键差异:

  • 训练数据差异:对于 SFT 微调,数据是查询(query)和答案(answer)拼接在一起。 然而,对于 RM 微调,每批数据由两个查询-答案对组成,即具有高分答案和低分答案的相同查询。 这也导致了如下所述的第二个差异。

  • 训练目标差异: 对于 RW,训练目标是 pairwise ranking score,即对于两个查询-答案对,RM 应该给更好的答案更高的分数。 有多种方法可以实现这一目标。 在DeepSpeed Chat的实现中,使用序列的结束标记或第一个填充标记作为聚合分数并比较它们。 当然,也可以使用整个答案的平均分数作为替代。

  • --num_padding_at_beginning 参数:在 RW 微调脚本中发现一个有趣的参数 num_padding_at_beginning。 添加此参数是因为注意到不同的模型可能具有不同的填充或分词器行为。 具体来说,OPT 模型族中的 tokenizer 总是在开头添加一个 padding token,这会影响我们对评分 token 的选择。 因此,我们需要考虑到这一点。

  • RW 评估:提供了一个评估脚本 rw_eval.py,供用户执行简单的提示回答测试。

这里我使用单机多卡基于opt-350m进行奖励模型的微调。当然,你也可以通过简单地将候选模型替换为您喜欢的模型并启用其他高效训练方法来训练更大的模型,如:SFT 微调过程中所述方法。

下面,修改奖励模型微调的训练脚本training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh

 

bash

复制代码

执行命令:

 

bash

复制代码

# Move into the second step of the pipeline cd training/step2_reward_model_finetuning sh training_scripts/single_node/run_350m.sh /home/guodong.li/output/dschat-reward 2

通过日志文件training.log查看训练过程,也可以通过命令tail -n100 -f training.log进行滚动日志查看

复制代码

> ls -al --block-size=M total 634M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 14:26 . drwxrwxr-x 11 guodong.li guodong.li 1M May 1 13:27 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:21 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:21 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 632M May 1 14:21 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:22 training.log -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 14:21 vocab.json

接下来进行奖励模型微调的模型评估。

运行命令:

 

bash

复制代码

sh evaluation_scripts/run_eval.sh /home/guodong.li/output/dschat-reward

运行过程:

 

vbnet

复制代码

> sh evaluation_scripts/run_eval.sh /home/guodong.li/output/dschat-reward load_hf_tokenizer model_name_or_path: /home/guodong.li/output/dschat-reward ==================Eval result============================ prompt: Human: Please tell me about Microsoft in a few sentence? Assistant: good_ans: Microsoft is a software company that develops, licenses, and supports software products, including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products. The company was founded in 1975 bad_ans: I'm not sure. Human: What's your job? Assistant: I'm not sure. Human: What's your favorite color? Assistant: I'm not sure. Human: What's your favorite food? Assistant: I'm not sure. Human: What's your favorite drink? Assistant: I'm not sure. =============Scores (higher, better)======================== good_ans score: 9.383882522583008 bad_ans score: -3.2731785774230957 ==================Eval result============================ prompt: Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: good_ans: The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another bad_ans: I don't know, I don't know. =============Scores (higher, better)======================== good_ans score: 9.291404724121094 bad_ans score: -0.04333972930908203

第三阶段:RLHF 训练

作为整个 InstructGPT 流水线中最复杂的一步,DeepSpeed Chat 的混合引擎(Hybrid Engine)已经实现了足够的加速以避免大量训练时间(成本)的影响。

前面两步已经有了微调的actor模型和reward模型的checkpoint,下面您只需运行以下脚本即可启用 PPO 训练。

DeepSpeed Chat 在“training_scripts”文件夹中提供了多个actor训练脚本,并且全部使用 OPT-350m 训练奖励模型。 但是,你可以根据自己的喜好尝试不同的奖励模型大小。

这里我使用单机多卡基于OPT-2.7b作为actor模型、基于OPT-350m作为奖励模型进行 RLHF 训练,同时修改opt-13b的训练脚本。

修改 RLHF 训练脚本 training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh

运行命令:

 

bash

复制代码

sh training_scripts/single_node/run_13b.sh /home/guodong.li/output/deepspeedchat /home/guodong.li/output/dschat-reward 3 3 /home/guodong.li/output/dschat-ppo

通过日志文件training.log查看训练过程,也可以通过命令tail -n100 -f training.log进行滚动日志查看:

|cri_loss: 0.0068149566650390625|unsuper_loss: 0.0 average reward score: -4.80078125 ------------------------------------------------------------------------------------- |E2E latency=32.81s |Gather latency=2.58s (7.87%) |Generate time=10.50s (31.99%) |Training time=15.92s (48.52%) |Others=6.39 (19.49%)|CurSamplesPerSec=7.80 |AvgSamplesPerSec=7.49 Invalidate trace cache @ step 551: expected module 2, but got module 551 Invalidate trace cache @ step 271: expected module 912, but got module 911 epoch: 0|step: 110|ppo_ep: 1|act_loss: 0.0003905296325683594|cri_loss: 0.00641632080078125|unsuper_loss: 0.0 ... ------------------------------------------------------------------------------------- |E2E latency=33.83s |Gather latency=3.25s (9.60%) |Generate time=9.96s (29.45%) |Training time=17.73s (52.40%) |Others=6.14 (18.15%)|CurSamplesPerSec=7.57 |AvgSamplesPerSec=7.49 epoch: 0|step: 119|ppo_ep: 1|act_loss: 0.00606536865234375|cri_loss: 0.0023479461669921875|unsuper_loss: 0.0 average reward score: -4.91796875 ------------------------------------------------------------------------------------- saving model ... ... saving model ... [2023-05-01 16:54:46,717] [INFO] [launch.py:460:main] Process 37162 exits successfully. ... [2023-05-01 16:54:49,720] [INFO] [launch.py:460:main] Process 37158 exits successfully. 

模型权重输出文件:

 

bash

复制代码

tree . ├── actor │   ├── config.json │   ├── merges.txt │   ├── pytorch_model.bin │   └── vocab.json ├── critic │   ├── config.json │   ├── merges.txt │   ├── pytorch_model.bin │   └── vocab.json └── training.log ######################################## > ls -al --block-size=M actor/ critic/ actor/: total 5059M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 16:54 . drwxrwxr-x 4 guodong.li guodong.li 1M May 1 16:54 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 5058M May 1 16:54 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 vocab.json critic/: total 634M drwxrwxr-x 2 guodong.li guodong.li 1M May 1 16:54 . drwxrwxr-x 4 guodong.li guodong.li 1M May 1 16:54 .. -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 config.json -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 merges.txt -rw-rw-r-- 1 guodong.li guodong.li 632M May 1 16:54 pytorch_model.bin -rw-rw-r-- 1 guodong.li guodong.li 1M May 1 16:54 vocab.json

一键式RLHF训练

DeepSpeed Chat提供了一个脚本即可完成 RLHF 训练的所有三个步骤并生成您的 类ChatGPT 模型。

运行命令:

 

css

复制代码

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

参数说明:

  • --deployment-type:部署类型,支持单卡(single_gpu),单机多卡(single_node),多机多卡(multi_node)
  • --actor-model:指定actor模型
  • --reward-model:指定reward模型
  • --output-dir:指定模型权重输出路径

运行过程:

模型权重输出文件:

 

lua

复制代码

tree ds-pipeline/ ds-pipeline/ ├── actor-models │   └── 13b │   ├── config.json │   ├── merges.txt │   ├── pytorch_model.bin │   ├── training.log │   └── vocab.json ├── reward-models │   └── 350m │   ├── config.json │   ├── merges.txt │   ├── pytorch_model.bin │   ├── training.log │   └── vocab.json └── step3-models └── 13b ├── actor │   ├── config.json │   ├── merges.txt │   ├── pytorch_model.bin │   └── vocab.json ├── critic │   ├── config.json │   ├── merges.txt │   ├── pytorch_model.bin │   └── vocab.json └── training.log

模型服务(推理)

为了快速测试由 DeepSpeed-Chat 训练的最终的模型,DeepSpeed-Chat提供了一个简单的脚本。

 

bash

复制代码

# serve the final model python chat.py --path ${PATH-to-your-actor-model}

运行过程:

 

bash

复制代码

> python chat.py --path /home/guodong.li/output/dschat-ppo/actor Enter input (type 'quit' to exit, 'clear' to clean memory): Do you know Microsoft? ------------------------------ Round 1 ------------------------------ Human: Do you know Microsoft? Assistant: Microsoft is a software company.</s> Enter input (type 'quit' to exit, 'clear' to clean memory): Can you explian it to a 6-year old child? ------------------------------ Round 2 ------------------------------ Human: Do you know Microsoft? Assistant: Microsoft is a software company.</s> Human: Can you explian it to a 6-year old child? Assistant: Microsoft is a software company.</s> Enter input (type 'quit' to exit, 'clear' to clean memory): who are you? ------------------------------ Round 3 ------------------------------ Human: Do you know Microsoft? Assistant: Microsoft is a software company.</s> Human: Can you explian it to a 6-year old child? Assistant: Microsoft is a software company.</s> Human: who are you? Assistant: Microsoft is a software company.</s></s> Enter input (type 'quit' to exit, 'clear' to clean memory):

如果想要使用通过DeepSpeed Chat训练好的模型创建个人助理、聊天机器人等不同LLM应用的用户,请参考LangChain。

结语

本文使用单机多卡基于OPT模型给大家分享了使用DeepSpeed Chat进行RLHF训练,希望能够给大家带来收获。

参考文档

  • DeepSpeed Chat: 一键式RLHF训练,让你的类ChatGPT千亿大模型提速省钱15倍
  • DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
  • 第一阶段: 有监督的微调 (SFT)
  • 第二阶段: 奖励模型微调
  • 第三阶段: 人工反馈强化学习 (RLHF)
  • DeepSpeed Chat 训练详细说明

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/648400.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

每日算法(第二十三期)

先来回顾一下上期的问题及答案&#xff1a; 2023年6月14日 「最接近的三数之和」&#xff08;3Sum Closest&#xff09;。以下是题目的描述&#xff1a; 给定一个包括 n 个整数的数组 nums 和一个目标值 target。找出 nums 中的三个整数&#xff0c;使得它们的和与 target 最接…

活动预告 | Hugging Face 音频 AI 派对直播

嘿嘿嘿&#xff0c;&#x1f917; 宝子们&#xff01;我们正在准备一个超级激动人心的音频 AI 派对&#xff0c;千万别错过&#xff01;为了庆祝 Hugging Face 新开设的免费开源 Audio Transformers 课程的启动&#xff0c;我们组织了一场不容错过的网络直播活动&#xff01; &…

Docker 容器入侵排查

随着越来越多的应用程序运行在容器里&#xff0c;各种容器安全事件也随之发生&#xff0c;例如攻击者可以通过容器应用获取容器控制权&#xff0c;利用失陷容器进行内网横向&#xff0c;并进一步逃逸到宿主机甚至攻击K8s集群。 容器的运行环境是相对独立而纯粹&#xff0c;当容…

redis客户端连接不上redis

总结 我先说一下&#xff0c;我觉得最有效就是把Linux防火墙tm的关了&#xff0c; 当我成功后&#xff0c;第二次连接时&#xff0c;防火墙开着&#xff0c;但是我能连接。 systemctl stop firewalled # 关闭防火墙 systemctl start firewalled # 开启防火墙 systemctl status …

利用DPU/IPU 卸载容器镜像以及文件系统的相关操作

1、背景和动机 随着云原生(Cloud Native)被工业界广泛接受&#xff0c;容器(container)在数据中心被广泛部署&#xff0c;其地位正在逐步取代传统的虚拟机(Virtual Machine)。当然目前依然存在用轻量虚拟机来运行和部署容器&#xff0c;比如使用Kata Containers。简单来讲&…

FPGA实现USB3.0 UVC 相机OV5640摄像头输出 基于FT602驱动 提供工程源码和QT上位机源码

目录 1、前言2、UVC简介3、FT602芯片解读4、我这儿的 FT601 USB3.0通信方案5、详细设计方案基于FT602的UVC模块详解 6、vivado工程详解7、上板调试验证8、福利&#xff1a;工程代码的获取 1、前言 目前USB3.0的实现方案很多&#xff0c;但就简单好用的角度而言&#xff0c;FT6…

2023 年 8 个最佳 React UI 组件库和框架

将展示八个最好的 React UI 组件库和框架&#xff0c;如下表所示&#xff1a;&#xff08;星标加关注&#xff0c;开车不迷路&#xff09; 「React Bootstrap&#xff1a;」一个与 Bootstrap 框架集成的实用的 React UI 库。「Grommet&#xff1a;」如果您想在设计中实现可访问…

基于Servlet+mysql+jsp学生宿舍信息管理系统

基于Servletmysqljsp学生宿舍信息管理系统 一、系统介绍二、功能展示1.用户登陆2.学生-主页面3.学生-缺勤记录4.学生-修改密码5.宿舍管理员-主页面6.宿舍管理员-学生查看7.宿舍管理员-缺勤记录8.系统管理员-宿舍管理员管理9.系统管理员-学生管理10.系统管理员-宿舍楼管理11.系统…

中高级前端面试秘籍,为你保驾护航金三银四

引言 各位大佬在评论中指出的种种问题小弟万分感谢。由于这一年来&#xff0c;出了不少变动&#xff0c;所以才一直耽搁&#xff0c;现已修复各位大佬指出的问题和建议。请大家放心食用&#xff01;感恩~&#x1f973; 当下&#xff0c;正面临着近几年来的最严重的互联网寒冬&a…

《文渊》期刊简介及投稿邮箱

《文渊》期刊简介及投稿邮箱 《文渊》是正规国家级连续型电子期刊&#xff0c;新闻出版广电总局可查&#xff0c;国家级教育核心刊物、中国核心期刊数据库收录期刊。 主管单位&#xff1a;中国出版传媒股份有限公司 主办单位&#xff1a;中国出版传媒股份有限公司 文渊&…

变量的线程安全分析

目录 变量的线程安全 常见线程安全类 变量的线程安全 成员变量和静态变量是否线程安全&#xff1f; 如果它们没有共享&#xff0c;则线程安全 如果它们被共享了&#xff0c;根据它们的状态是否能够改变&#xff0c;又分两种情况 如果只有读操作&#xff0c;则线程安全如果…

【30天熟悉Go语言】7 Go流程控制之分支结构if、switch

文章目录 一、前言二、if1、单分支Go语法规范&#xff1a; 2、双分支Go语法规范 3、多分支 三、switch1、基本语法2、语法规范1&#xff09;switch2&#xff09;case3&#xff09;default 四、总结 一、前言 Go系列文章&#xff1a; GO开篇&#xff1a;手握Java走进Golang的世界…

手机APP大用户并发测试

一、背景 随着智能手机近年来的快速增长&#xff0c;从游戏娱乐到移动办公的各式各样的手机APP软件渗透到我们的生活中&#xff0c;对于大型的手机APP测试不仅要关注它的功能性、易用性还要关注它的性能&#xff0c;最近发现LoadRunner12可以对手机APP做性能测试&#xff0c;但…

寻味一座城,从吃吃吃开始

点击文末“阅读原文”即可参与节目互动 剪辑、音频 / 小黑 运营 / SandLiu 卷圈 监制 / 姝琦 文案 / 小黑 产品统筹 / bobo 场地支持 / 声湃轩天津站 为了再也不用在节目里喊“我们真的不是美食节目”&#xff0c;2023年7月起&#xff0c;原汤话原食将更名为“记者下班”…

记录--极致舒适的Vue可编辑表格

这里给大家分享我在网上总结出来的一些知识&#xff0c;希望对大家有所帮助 使用ElementPlus的Table啥都好&#xff0c;就是没有可编辑表格&#xff01;&#xff01;&#xff01;&#x1f62d; 既然UI库不支持&#xff0c;那我们实现一个可编辑表格是很难的事么&#xff1f;&am…

Avalon 学习系列(五)—— 过滤器

Avalon 本身有很多过滤器&#xff0c;例如 date、number等文本过滤器、循环过滤器&#xff1b;avalon 也提供了方法可以根据需求自定义过滤器。 示例&#xff1a; &#xff08;1&#xff09;定义一个 myFunc 的个性化过滤器&#xff0c;并加在元素上&#xff1b; &#xff0…

一文带你玩转 RustChinaConf 2023,内含赞助商展位活动福利和 Workshop 介绍

除了两天干货满满的会议外&#xff0c;RustChinaConf 的赞助商也准备了精美的周边礼物等待大家去打卡。每位参会者在签到的时候会获得一张集章卡&#xff0c;集齐上面所有的章&#xff0c;可至签到处兑换精美礼物一份。偷偷剧透一下&#xff0c;奖品有大家喜欢的 Rust 小螃蟹玩…

【计算机网络】第一章 概述(上)

文章目录 第一章 概述1.2 因特网概述1.2.1 网络、互连网&#xff08;互联网&#xff09;和因特网1.2.2 因特网发展的三个阶段1.2.4 因特网的组成 1.3 三种交换方式1.3.1 电路交换1.3.2 分组交换1.3.3 报文交换1.3.4 三种方式对比 1.4 计算机网络的定义 第一章 概述 1.2 因特网概…

「深度学习之优化算法」笔记(二)优化算法的分类

1. 优化算法的分类 1.1常见的优化算法 在分类之前&#xff0c;我们先列举一下常见的优化算法&#xff08;不然我们拿什么分类呢&#xff1f;&#xff09; 1.遗传算法Genetic algorithm 2.粒子群优化算法Particle Swarm Optimization 3.差分进化算法Differential Evolution 4.人…

springboot第27集:springboot-mvc,WxPay

在数据库中&#xff0c;DISTINCT 关键字用于查询去重后的结果集。它用于从查询结果中去除重复的行&#xff0c;只返回唯一的行。 要使用 DISTINCT 关键字&#xff0c;可以将其放置在 SELECT 关键字之前&#xff0c;指示数据库返回去重后的结果。 请注意&#xff0c;DISTINCT 关…