关于YOLOv5的训练，GPU单卡、多卡设置，加速训练

news2026/2/12 20:26:51

yolov5毫无疑问是目前目标检测框架中非常准确快速的检测框架之一，在工业界和学术界应用广泛，其优势不言而喻。

在模型训练或推理时，我们都想快速完成，特别是数据量很大的时候，效率就是非常迫切需要提升的。这里简单介绍一下yolov5的多种训练方法，便于理解深度学习的模型训练方法，同时基于自身的硬件条件选择高效的训练方法。

yolov5训练方法的官方网站：https://github.com/ultralytics/yolov5/issues/475

如果条件允许，首推的是多卡DDP训练模式
在这里插入图片描述

1、Single GPU 单卡训练

python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0

2、Multi-GPU DataParallel Mode (⚠️ not recommended)

多卡DP训练不推荐

官方也不推荐该方法，该方法训练的时候速度快不了多少，而且该方法训练时把数据放到多张卡上，但是计算结果在主卡上进行，会导致主卡和其他卡的内存使用不平衡，不推荐。

python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

官方说该方法慢，训练时比起单卡加速很小，This method is slow and barely speeds up training compared to using just 1 GPU.

3、Multi-GPU DistributedDataParallel Mode (✅ recommended)

强推多卡DDP方法
需要通过设置 python -m torch.distributed.run --nproc_per_node 运行命令如下

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

--nproc_per_node 表示使用的GPU数量，specifies how many GPUs you would like to use. In the example above, it is 2.
--batch 表示批量处理的图片数量，会被均分到每张卡上，the total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.

示例：

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --device 2,3

关于DDP方法使用 SyncBatchNorm，使用SyncBatchNorm可以提升精度，但是会降低训练速度，而且只适用于DDP，平分到每张卡上的 batch-size <= 8 时效果更好，通过命令行增加参数标志--sync-bn 执行

SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training.

It is best used when the batch-size on each GPU is small (<= 8).

To use SyncBatchNorm, simple pass --sync-bn to the command like below,

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --sync-bn

多台机器训练，多台机器训练需要保持机器之间的通信，其效率会受一定影响，官方的多机器训练设置命令：

# On machine R
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''

where G is number of GPU per machine, N is the number of machines, and R is the machine number from 0…(N-1).
Let’s say I have two machines with two GPUs each, it would be G = 2 , N = 2, and R = 1 for the above.
其中G是每台机器的GPU数量，N是机器数量，R是机器序号，表示汇总到哪台机器（master machine）
Training will not start until all N machines are connected. Output will only be shown on master machine!

4、在train.py的相关代码

DP方法的代码：

# DP mode
if cuda and RANK == -1 and torch.cuda.device_count() > 1:
    LOGGER.warning(
        'WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.\n'
        'See Multi-GPU Tutorial at https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training to get started.'
    )
    model = torch.nn.DataParallel(model)

是否使用SyncBatchNorm：

# SyncBatchNorm
if opt.sync_bn and cuda and RANK != -1:
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
    LOGGER.info('Using SyncBatchNorm()')

使用DDP方法：

# DDP mode
if cuda and RANK != -1:
    model = smart_DDP(model)

其中涉及smart_DDP的代码：

from torch.nn.parallel import DistributedDataParallel as DDP
def smart_DDP(model):
    # Model DDP creation with checks
    assert not check_version(torch.__version__, '1.12.0', pinned=True), \
        'torch==1.12.0 torchvision==0.13.0 DDP training is not supported due to a known issue. ' \
        'Please upgrade or downgrade torch to use DDP. See https://github.com/ultralytics/yolov5/issues/8395'
    if check_version(torch.__version__, '1.11.0'):
        return DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK, static_graph=True)
    else:
        return DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)