中英双语介绍DeepSpeed 的 ZeRO 优化

DeepSpeed 的 ZeRO 优化：通俗易懂的原理与实践指南

引言

在深度学习的大规模模型训练中，显存瓶颈是常见的挑战。DeepSpeed 提供了革命性的 ZeRO (Zero Redundancy Optimizer) 优化技术，为大模型训练节省显存、提高效率提供了强有力的工具。本文将深入介绍 ZeRO 的原理、三种阶段的特点及其适用场景，并结合实际脚本展示如何使用 ZeRO Stage 2 进行优化。本文借助https://github.com/allenai/open-instruct这个框架来展开。

ZeRO 优化的核心原理

通常在深度学习中，模型训练会使用大量显存来存储：

模型参数
梯度
优化器状态

这些资源在分布式训练时会被复制到每张 GPU 上，造成显存浪费。ZeRO 的目标是消除冗余数据存储，将这些资源高效地分布到多张 GPU 上，使得大模型的训练在有限硬件条件下成为可能。

ZeRO 优化通过以下方式工作：

参数分片 (Parameter Sharding)
将模型参数、梯度和优化器状态分割到不同的设备上，各 GPU 只存储自己负责的一部分。
按需加载
各 GPU 只在需要时访问其他分片的数据，避免不必要的通信和存储开销。

ZeRO 的三种阶段及区别

DeepSpeed 提供了 ZeRO 的三种优化阶段，每个阶段在显存占用和通信开销之间找到不同的平衡：

Stage 1: 优化器状态分片

特点：
仅将优化器状态分片存储，各 GPU 保留完整的梯度和模型参数。
优点：
易于实现，通信量小。
适用场景：
中等规模的模型训练，显存节省有限。

Stage 2: 优化器状态和梯度分片

特点：
在 Stage 1 的基础上，进一步将梯度分片，各 GPU 只存储自己负责的部分。
优点：
较大幅度降低显存需求，同时通信开销适中。
适用场景：
大多数大规模模型训练的理想选择，兼顾效率和资源节省。

Stage 3: 优化器状态、梯度和模型参数分片

特点：
所有相关数据（优化器状态、梯度和模型参数）都被分片存储，训练过程需要频繁的通信。
优点：
显存需求最小，适合超大模型。
适用场景：
超大规模模型（如 GPT-3）训练，但通信开销高，对网络带宽要求高。

ZeRO Stage 2 的使用实践

以下通过一个完整示例，展示如何使用 ZeRO Stage 2 优化器。

配置文件

创建一个 DeepSpeed 配置文件，命名为 stage2_no_offloading_accelerate.conf：

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "sub_group_size": 1e9
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

stage: 2 表示启用 ZeRO Stage 2。
overlap_comm: true 表示通信和计算重叠以加速训练。
contiguous_gradients: true 用于优化梯度内存布局，进一步减少显存碎片化。
文件所在位置如下图所示：

Bash 脚本

以下是完整的训练脚本，结合了 ZeRO Stage 2 的优化设置：脚本来源于https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md

# 设置模型和训练参数
MODEL_NAME=google/gemma-2-2b
MACHINE_RANK=0
MAIN_PROCESS_IP=127.0.0.1
MAIN_PROCESS_PORT=29400
NUM_MACHINES=1
NUM_PROCESSES=4
PER_DEVICE_TRAIN_BATCH_SIZE=1
GRADIENT_ACCUMULATION_STEPS=2

# 启动命令
accelerate launch \
    --mixed_precision bf16 \
    --num_machines $NUM_MACHINES \
    --num_processes $NUM_PROCESSES \
    --machine_rank $MACHINE_RANK \
    --main_process_ip $MAIN_PROCESS_IP \
    --main_process_port $MAIN_PROCESS_PORT \
    --use_deepspeed \
    --deepspeed_config_file configs/ds_configs/stage2_no_offloading_accelerate.conf \
    --deepspeed_multinode_launcher standard open_instruct/finetune.py \
    --model_name_or_path $MODEL_NAME \
    --tokenizer_name $MODEL_NAME \
    --use_slow_tokenizer \
    --use_flash_attn \
    --max_seq_length 2048 \
    --preprocessing_num_workers 4 \
    --per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --learning_rate 5e-06 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --weight_decay 0.0 \
    --num_train_epochs 1 \
    --output_dir output/sft_2b \
    --with_tracking \
    --report_to wandb \
    --logging_steps 1 \
    --reduce_loss sum \
    --model_revision main \
    --dataset_mixer_list allenai/tulu-3-sft-mixture 1.0 \
    --checkpointing_steps epoch \
    --dataset_mix_dir output/sft_2b \
    --exp_name tulu-2b-sft \
    --seed 123

关键参数说明

--use_deepspeed：启用 DeepSpeed。
--deepspeed_config_file：指定配置文件路径。
--mixed_precision bf16：启用混合精度训练以节省显存。

使用 ZeRO Stage 2 的优势

显存优化：
相较于 Stage 1，进一步减少了梯度显存的需求。
适配性强：
在不显著增加通信成本的前提下，支持大规模模型的训练。
易用性：
配置简单，适合大部分用户。

总结

DeepSpeed 的 ZeRO 优化技术为深度学习领域提供了显存节省和高效训练的全新工具。从 Stage 1 到 Stage 3，不同阶段适应不同的场景需求。通过本文的介绍，相信你已经了解了 ZeRO 的核心原理、三阶段的特点以及如何在实际训练中使用 ZeRO Stage 2。

DeepSpeed ZeRO Optimization: A Beginner-Friendly Guide to Concepts and Practical Use

Introduction

Training large-scale models often runs into memory bottlenecks due to the limited capacity of GPUs. DeepSpeed introduces the ZeRO (Zero Redundancy Optimizer) optimization technique, a groundbreaking solution to reduce memory usage and improve efficiency during training. This article provides a detailed yet beginner-friendly explanation of ZeRO’s principles, its three optimization stages, and practical guidance on using ZeRO Stage 2, complete with example configurations and scripts.

Core Principles of ZeRO Optimization

In deep learning, significant GPU memory is consumed by:

Model parameters
Gradients
Optimizer states

Typically, this data is replicated across GPUs in distributed training, leading to wasted memory. ZeRO aims to eliminate redundancy by distributing these components across GPUs, enabling efficient utilization of memory.

Key strategies of ZeRO include:

Sharding: Splitting model parameters, gradients, and optimizer states across GPUs. Each GPU stores only a portion of the data it is responsible for.
Lazy loading: GPUs only access the shards they need, reducing unnecessary data movement and storage.

ZeRO Optimization Stages

ZeRO is implemented in three progressive stages, each targeting different components for memory optimization.

Stage 1: Sharding Optimizer States

What it does:
Splits optimizer states across GPUs, while gradients and model parameters remain fully replicated.
Benefits:
Simple to implement with minimal communication overhead.
Best for:
Medium-sized models where some memory savings suffice.

Stage 2: Sharding Optimizer States and Gradients

What it does:
In addition to optimizer state sharding, gradients are also partitioned among GPUs.
Benefits:
Significant memory savings with moderate communication costs.
Best for:
Most large-scale model training tasks, balancing efficiency and memory use.

Stage 3: Sharding Everything

What it does:
Shards optimizer states, gradients, and model parameters. Communication is required to reconstruct data during training.
Benefits:
Maximum memory savings, suitable for ultra-large models.
Best for:
Extremely large models like GPT-3, where memory limits are critical. However, it demands high network bandwidth.

Practical Guide: Using ZeRO Stage 2

Below is a step-by-step guide to implement ZeRO Stage 2 for optimizing your model training.

Configuration File

Create a DeepSpeed configuration file named stage2_no_offloading_accelerate.conf:

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "sub_group_size": 1e9
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

stage: 2: Specifies that ZeRO Stage 2 is being used.
overlap_comm: true: Overlaps communication with computation for faster training.
contiguous_gradients: true: Optimizes gradient memory layout to reduce fragmentation.

Training Script

Below is a sample training script using ZeRO Stage 2 for optimization:

# Define model and training parameters
MODEL_NAME=google/gemma-2-2b
MACHINE_RANK=0
MAIN_PROCESS_IP=127.0.0.1
MAIN_PROCESS_PORT=29400
NUM_MACHINES=1
NUM_PROCESSES=4
PER_DEVICE_TRAIN_BATCH_SIZE=1
GRADIENT_ACCUMULATION_STEPS=2

# Launch training with DeepSpeed
accelerate launch \
    --mixed_precision bf16 \
    --num_machines $NUM_MACHINES \
    --num_processes $NUM_PROCESSES \
    --machine_rank $MACHINE_RANK \
    --main_process_ip $MAIN_PROCESS_IP \
    --main_process_port $MAIN_PROCESS_PORT \
    --use_deepspeed \
    --deepspeed_config_file configs/ds_configs/stage2_no_offloading_accelerate.conf \
    --deepspeed_multinode_launcher standard open_instruct/finetune.py \
    --model_name_or_path $MODEL_NAME \
    --tokenizer_name $MODEL_NAME \
    --use_slow_tokenizer \
    --use_flash_attn \
    --max_seq_length 2048 \
    --preprocessing_num_workers 4 \
    --per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --learning_rate 5e-06 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --weight_decay 0.0 \
    --num_train_epochs 1 \
    --output_dir output/sft_2b \
    --with_tracking \
    --report_to wandb \
    --logging_steps 1 \
    --reduce_loss sum \
    --model_revision main \
    --dataset_mixer_list allenai/tulu-3-sft-mixture 1.0 \
    --checkpointing_steps epoch \
    --dataset_mix_dir output/sft_2b \
    --exp_name tulu-2b-sft \
    --seed 123

Key Arguments Explained

--use_deepspeed: Enables DeepSpeed integration.
--deepspeed_config_file: Points to the configuration file.
--mixed_precision bf16: Uses bfloat16 for reduced precision to save memory and improve speed.

Why Use ZeRO Stage 2?

Memory Optimization:
Shards optimizer states and gradients, significantly reducing memory usage.
Balanced Trade-Off:
Achieves considerable memory savings without high communication overhead, unlike Stage 3.
Ease of Use:
Simple configuration and implementation, making it a great choice for most users.

Conclusion

DeepSpeed’s ZeRO optimization revolutionizes distributed training by tackling memory bottlenecks efficiently. The three stages of ZeRO provide flexibility for different scales and requirements, with Stage 2 being a practical and balanced choice for most large-scale models.

By following this guide, you can confidently implement ZeRO Stage 2 in your projects and take advantage of its memory-saving capabilities. For further exploration, the DeepSpeed documentation is an excellent resource to delve into advanced topics like offloading and custom configurations.