Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
2024年1月24日,作者:Douglas Jia
在这篇博客中,我们将向你展示如何使用Megatron-DeepSpeed框架在多块AMD GPU上预训练GPT-3模型。我们还将展示如何使用你预训练的模型执行文本生成任务的推理。
什么是Megatron-DeepSpeed?
微软通过将其DeepSpeed库整合到NVIDIA的Megatron-LM框架中开发了Megatron-DeepSpeed。DeepSpeed是微软的优化库。它旨在简化和增强分布式训练和推理。DeepSpeed引入了一系列优化,使这些过程更高效、更有效。
Megatron-LM是NVIDIA的大型且强大的变换器。它能够处理庞大的模型和复杂的深度学习任务,使其成为DeepSpeed带来的进步的理想起点。
Megatron-DeepSpeed脱颖而出的是其对丰富功能的全面支持,从专家混合训练到课程学习。这使其成为应对深度学习领域各种挑战的多功能工具。使用Megatron-DeepSpeed,你可以以前所未有的效率和规模训练更大的模型。
3D 并行
Megatron-DeepSpeed 的亮点在于其 3D 并行性的实现。这种方法结合了零冗余优化器(ZeRO)分片、来自 DeepSpeed 的流水线并行性以及来自 Megatron-LM 的张量并行性。该组合使您能够高效地训练庞大的模型,从而在模型可扩展性方面开辟了新的前沿。
与 TensorParallel 一样,ZeRO 执行张量分片。ZeRO 的独特之处在于它能够在计算时及时重建整个张量,而无需对模型进行任何修改。这种创新方法还支持各种卸载技术,以应对 GPU 内存限制。
Megatron-DeepSpeed 引入了 3D 并行性的三个关键组件:
-
数据并行(DataParallel):复制设置并并行处理数据切片,在每一步结束时同步。
-
张量并行(TensorParallel):将张量切片分布到多个 GPU 上进行独立的并行处理,允许横向切割。
-
流水线并行(PipelineParallel):在层级上垂直拆分模型,将其分布在多个 GPU 上,以启用不同阶段的并行处理。
为什么使用 AMD GPU?
AMD GPUs 提供了强大的开源支持,具备如 ROCm 和 HIP 等工具,使其能够轻松适应 AI 工作流。我们的高性价比适用于任何寻求经济高效解决方案以进行 AI 和深度学习任务的个人。随着 AMD 在市场上影响力的增长,越来越多的机器学习库和框架添加了对 AMD GPU 的支持。
硬件和软件要求
为了实现此任务所需的计算能力,我们使用了 AMD 加速器云 (AAC),这是一个提供按需云计算资源和 API 的平台。在 AAC 上,我们使用了一个 PyTorch Docker 容器(版本:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1;我们也测试过版本:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2)并配备了 8 个 GPU。
我们的方法是硬件无关的,这意味着访问 AAC 并不是成功运行我们的代码示例的必要条件。只要您能够访问加速设备,如 GPU 或张量处理单元 (TPUs),您应该能够在最小修改的情况下运行这些代码示例。如果您使用的是 AMD GPU,请确保正确安装了 ROCm 及其兼容版本的 PyTorch。请参考以下两个教程进行安装:
-
ROCm 安装
-
PyTorch 安装
GPT-3 模型的预训练代码示例
首先,安装 DeepSpeed(以及其他所需的软件包)并克隆 Megatron-DeepSpeed GitHub 仓库到本地(或服务器上)。然后,您需要下载并预处理用于预训练的数据集。带有 %%sh
的代码块表示 Linux 命令行代码。我们使用 /home/aac
作为我们的主目录(或者如果直接从 Docker Hub 拉取 docker,则使用 /var/lib/jenkins
);运行代码时,请将其替换为您的主目录。
%%sh
python -m pip install --upgrade pip
#安装 DeepSpeed 和其他软件包
home_dir=/var/lib/jenkins
cd $home_dir
pip install -U pip \
&& pip3 install deepspeed transformers pybind11 nltk ipython matplotlib
# 克隆 Megatron-DeepSpeed 仓库
cd $home_dir
git clone https://github.com/microsoft/Megatron-DeepSpeed.git
cd Megatron-DeepSpeed
# 安装 libaio-dev
apt-get update && apt-get -y install libaio-dev rustc cargo
# 下载数据集
cd dataset
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
bash download_vocab.sh
# 对 oscar 数据集进行预处理
export BASE_SRC_PATH=$home_dir/Megatron-DeepSpeed
export BASE_DATA_PATH=${BASE_SRC_PATH}/dataset
python3 ${BASE_SRC_PATH}/tools/preprocess_data.py --input ${BASE_DATA_PATH}/oscar-1GB.jsonl --output-prefix ${BASE_DATA_PATH}/my-gpt2 --vocab-file ${BASE_DATA_PATH}/gpt2-vocab.json --dataset-impl mmap --tokenizer-type GPT2BPETokenizer --merge-file ${BASE_DATA_PATH}/gpt2-merges.txt --append-eod --workers 8
# 安装 FlashAttention(可选)。FlashAttention 提供了一个快速且内存高效的注意力机制解决方案。如果您不想使用 FlashAttention,请删除脚本中的 `--use-flash-attn` 标志。
cd $home_dir
git clone --recursive https://github.com/ROCmSoftwarePlatform/flash-attention.git
cd flash-attention
py_version=$(python -V | grep -oP '(?<=[.])\w+(?=[.])')
patch /opt/conda/envs/py_3.${py_version}/lib/python3.${py_version}/site-packages/torch/utils/hipify/hipify_python.py hipify_patch.patch
python setup.py install
接下来,使用一个节点上的 8 个 GPU 训练一个小 GPT-3 模型。主要的训练脚本是 ds_pretrain_gpt_125M_flashattn.sh
。您必须修改几行代码以匹配您的预期配置(例如,如何设置模型检查点的保存频率,以及如何设置 3D 并行配置)。下面是一些您可能需要修改的配置列表:
-
num_gpus
-
num_gpus_pernode
-
num_node
-
log_interval
-
eval_iters
-
eval_interval
-
num_save
-
save_interval
-
vocab_path
-
merge_path
-
data_path
-
File paths in
data_options
由于 ROCm 当前不支持梯度累积融合,您必须在 megatron_options
中添加 --no-gradient-accumulation-fusion
。您可以查看我们使用的实际训练脚本,以了解需要修改哪些内容以及如何进行。
%%sh
cd /home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase
nohup bash ds_pretrain_gpt_125M_flashattn.sh &
预训练输出保存到输出文件夹中。如需确保一切正常工作,您可以验证文件是否存在。
将 DeepSpeed 检查点转换为 Hugging Face 检查点
Megatron-DeepSpeed 包保存的检查点是 DeepSpeed 格式的。你可以使用 tools/convert_checkpoint
文件夹中的函数将其转换为 Megatron 或 Hugging Face 格式。在我们的推理示例中,我们将检查点转换为 Hugging Face 格式。你可能需要修改 tools/convert_checkpoint/deepspeed_to_megatron.py
文件以运行程序(将 from .deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint
更改为 from deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint
)。我们将迭代 2000 次和 8000 次的检查点转换,以便能够比较在推理中的性能。你必须在 Python 命令中修改检查点路径以匹配其本地路径。
%%sh
# 安装这一步骤所需的包
pip install matplotlib megatron megatron.core transformers
# 将 8000 次迭代的检查点转换为 HF transformers 格式
python /home/aac/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py \
--input_folder /home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase/output/checkpoint/gpt_0.125B_tok300B_lr6.0e-4_min1.0e-6_w3000M_d300B_cosine_gbs256_mbs2_g8_z1_mp2_pp2_seed1234_rebase/global_step8000 \
--output_folder /home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase/output/checkpoint/gpt_0.125B_tok300B_lr6.0e-4_min1.0e-6_w3000M_d300B_cosine_gbs256_mbs2_g8_z1_mp2_pp2_seed1234_rebase/HF/global_step8000
# 转换 2000 次迭代的另一个检查点,以便我们可以比较模型性能
python /home/aac/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py \
--input_folder /home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase/output/checkpoint/gpt_0.125B_tok300B_lr6.0e-4_min1.0e-6_w3000M_d300B_cosine_gbs256_mbs2_g8_z1_mp2_pp2_seed1234_rebase/global_step2000 \
--output_folder /home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase/output/checkpoint/gpt_0.125B_tok300B_lr6.0e-4_min1.0e-6_w3000M_d300B_cosine_gbs256_mbs2_g8_z1_mp2_pp2_seed1234_rebase/HF/global_step2000
加载预训练模型并执行文本生成任务
现在您可以评估预训练模型的性能。虽然预训练模型通常需要针对下游任务进行微调,但通过执行文本生成任务,您仍然可以了解预训练模型的能力。我们将迭代了2,000次和8,000次的检查点分别加载到`model0`和`model1`中,并使用提示“我喜欢打高尔夫球。今天是个晴天,我计划。”来评估它们的文本生成能力。每个模型根据这个提示生成三个样本。您需要将路径`path0`和`path1`修改为对应模型的检查点路径。
from transformers import GPT2LMHeadModel
from transformers import GPT2Tokenizer
from transformers import set_seed
import torch
path0 = "/home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase/output/checkpoint/gpt_0.125B_tok300B_lr6.0e-4_min1.0e-6_w3000M_d300B_cosine_gbs256_mbs2_g8_z1_mp2_pp2_seed1234_rebase/HF/global_step2000/"
path1 = "/home/aac/Megatron-DeepSpeed/examples_deepspeed/rebase/output/checkpoint/gpt_0.125B_tok300B_lr6.0e-4_min1.0e-6_w3000M_d300B_cosine_gbs256_mbs2_g8_z1_mp2_pp2_seed1234_rebase/HF/global_step8000/"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = GPT2Tokenizer(vocab_file='/home/aac/Megatron-DeepSpeed/dataset/gpt2-vocab.json', merges_file='/home/aac/Megatron-DeepSpeed/dataset/gpt2-merges.txt')
model0 = GPT2LMHeadModel.from_pretrained(path0, pad_token_id=tokenizer.eos_token_id).to(torch_device)
model1 = GPT2LMHeadModel.from_pretrained(path1, pad_token_id=tokenizer.eos_token_id).to(torch_device)
# 有关如何微调文本生成过程的更多信息,请参见: https://huggingface.co/blog/how-to-generate
# 编码上下文以条件生成
model_inputs = tokenizer('I like to play golf. Today is a sunny day and I plan to', return_tensors='pt').to(torch_device)
# 设置 top_k = 50,top_p = 0.95,以及 num_return_sequences = 3
set_seed(1)
# Set top_k = 50, top_p = 0.95, and num_return_sequences = 3
sample_outputs = model0.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
print("从2000次迭代的检查点生成的输出:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
# 设置 top_k = 50,top_p = 0.95,以及 num_return_sequences = 3
sample_outputs = model1.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
print("\n从8000次迭代的检查点生成的输出:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
从2000次迭代的检查点生成的输出:
----------------------------------------------------------------------------------------------------
0: I like to play golf. Today is a sunny day and I plan to work and get to work with my team. I think that I can make money but I make the effort to get to see this. I know how it works, but I do think that will
1: I like to play golf. Today is a sunny day and I plan to go to the side of my life. It’s really simple! We have been there for a couple of days to try our training program. I have heard the video out there, I think
2: I like to play golf. Today is a sunny day and I plan to get along that summer. A great weekend and a good one can be prepared. I'm a great place to try. It's fun to go and give you the chance to get along with me
从8000次迭代的检查点生成的输出:
----------------------------------------------------------------------------------------------------
0: I like to play golf. Today is a sunny day and I plan to play some golf in the evening. I have not played my other tournaments until this morning.
1: I like to play golf. Today is a sunny day and I plan to play the whole week of golf. I will be playing in the backyards to play golf. If you are still interested in playing the “American Association” Tournament, please don't hesitate
2: I like to play golf. Today is a sunny day and I plan to get there on Monday morning. You’ll notice me playing in the backyard. My dad bought me the equipment, so I could throw it at home. When we went out to dinner we
通过分析生成的样本,我们发现:model1
生成的文本更合逻辑并且与提供的上下文更相关。请注意,我们在8个MI210 GPU上运行不到两天就达到了这个能力(具体时间会因所用的GPU型号而异)。如果您不愿意进行广泛的预训练过程,您可以直接从Hugging Face获取这两个模型的检查点,如下所示:
model3 = GPT2LMHeadModel.from_pretrained('jiagaoxiang/gpt3-125M-2000iter', pad_token_id=tokenizer.eos_token_id).to(torch_device)
model4 = GPT2LMHeadModel.from_pretrained('jiagaoxiang/gpt3-125M-8000iter', pad_token_id=tokenizer.eos_token_id).to(torch_device)
model_inputs = tokenizer('I like to play golf. Today is a sunny day and I plan to', return_tensors='pt').to(torch_device)
# 设置随机种子以重现结果(您可以更改种子以获得不同的结果)。
set_seed(1)
# 设置 top_k = 50,top_p = 0.95,以及 num_return_sequences = 3
sample_outputs = model3.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
print("从2000次迭代的检查点生成的输出:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
# 设置 top_k = 50,top_p = 0.95,以及 num_return_sequences = 3
sample_outputs = model4.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
print("\从8000次迭代的检查点生成的输出:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
从2000次迭代的检查点生成的输出:
----------------------------------------------------------------------------------------------------
0: I like to play golf. Today is a sunny day and I plan to work and get to work with my team. I think that I can make money but I make the effort to get to see this. I know how it works, but I do think that will
1: I like to play golf. Today is a sunny day and I plan to go to the side of my life. It’s really simple! We have been there for a couple of days to try our training program. I have heard the video out there, I think
2: I like to play golf. Today is a sunny day and I plan to get along that summer. A great weekend and a good one can be prepared. I'm a great place to try. It's fun to go and give you the chance to get along with me
从8000次迭代的检查点生成的输出:
----------------------------------------------------------------------------------------------------
0: I like to play golf. Today is a sunny day and I plan to play some golf in the evening. I have not played my other tournaments until this morning.
1: I like to play golf. Today is a sunny day and I plan to play the whole week of golf. I will be playing in the backyards to play golf. If you are still interested in playing the “American Association” Tournament, please don't hesitate
2: I like to play golf. Today is a sunny day and I plan to get there on Monday morning. You’ll notice me playing in the backyard. My dad bought me the equipment, so I could throw it at home. When we went out to dinner we