yolov8官方教程提供了2种训练方式,一种是通过命令行启动训练,一种是通过写代码启动。
命令行的方式启动方便,通过传入参数可以方便的调整训练参数,但这种方式不方便记录训练参数和调试训练代码。
自行写训练代码的方式更灵活,也比较方便调试,但官方的示例各种参数都是在代码中写死的方式,失去了灵活性。
其实我们可以结合这两种方法的优势,既能够通过命令行参数修改很容易变化的参数(如batch size, epoch, imgsz等),然后用配置文件保存很少需要变化的参数,或者这些变化需要保存下来方便对比(如各类增强比例)。
代码分析
首先我们需要知道我们能够设置哪些参数,尽管官方文档列出了命令行能够传入的参数列表,但每次设置大量参数还是不方便,而不设置的时候默认参数是多少我们也不知道,所以还是有必要分析一下代码。
通过模型的train接口我们会知道所有的Trainer均继承自BaseTrainer(yolo/engine/trainer.py),该类的构造函数如下:
def __init__(self, cfg=DEFAULT_CFG, overrides=None, _callbacks=None):
"""
Initializes the BaseTrainer class.
Args:
cfg (str, optional): Path to a configuration file. Defaults to DEFAULT_CFG.
overrides (dict, optional): Configuration overrides. Defaults to None.
"""
self.args = get_cfg(cfg, overrides)
self.device = select_device(self.args.device, self.args.batch)
self.check_resume()
...
其中overrides就是我们设置的参数,我们未设置的参数则来源于DEFAULT_CFG,继续跟踪我们会发现这个DEFAULT_CFG实际来源于yolo/cfg/default.yaml:
# Ultralytics YOLO 🚀, AGPL-3.0 license
# Default training settings and hyperparameters for medium-augmentation COCO training
task: detect # YOLO task, i.e. detect, segment, classify, pose
mode: train # YOLO mode, i.e. train, val, predict, export, track, benchmark
# Train settings -------------------------------------------------------------------------------------------------------
model: # path to model file, i.e. yolov8n.pt, yolov8n.yaml
data: # path to data file, i.e. coco128.yaml
epochs: 100 # number of epochs to train for
start_epoch: 0 # start epoch
patience: 50 # epochs to wait for no observable improvement for early stopping of training
batch: 16 # number of images per batch (-1 for AutoBatch)
imgsz: 640 # size of input images as integer or w,h
save: True # save train checkpoints and predict results
save_period: -1 # Save checkpoint every x epochs (disabled if < 1)
cache: False # True/ram, disk or False. Use cache for data loading
device: # device to run on, i.e. cuda device=0 or device=0,1,2,3 or device=cpu
workers: 8 # number of worker threads for data loading (per RANK if DDP)
project: # project name
name: # experiment name, results saved to 'project/name' directory
exist_ok: False # whether to overwrite existing experiment
pretrained: False # whether to use a pretrained model
optimizer: SGD # optimizer to use, choices=[SGD, Adam, Adamax, AdamW, NAdam, RAdam, RMSProp, auto]
verbose: True # whether to print verbose output
seed: 0 # random seed for reproducibility
deterministic: True # whether to enable deterministic mode
single_cls: False # train multi-class data as single-class
rect: False # rectangular training if mode='train' or rectangular validation if mode='val'
cos_lr: False # use cosine learning rate scheduler
close_mosaic: 0 # (int) disable mosaic augmentation for final epochs
resume: False # resume training from last checkpoint
amp: True # Automatic Mixed Precision (AMP) training, choices=[True, False], True runs AMP check
fraction: 1.0 # dataset fraction to train on (default is 1.0, all images in train set)
profile: False # profile ONNX and TensorRT speeds during training for loggers
# Segmentation
overlap_mask: True # masks should overlap during training (segment train only)
mask_ratio: 4 # mask downsample ratio (segment train only)
# Classification
dropout: 0.0 # use dropout regularization (classify train only)
...
我们所有能设置的参数就在这个文件中,如果我们设置了不在其中的参数则会报错(下一篇介绍怎么增加参数)。
自定义参数配置文件
我们可以将训练会调整的参数单独保存到一个yaml文件,如hyp.scratch.yaml作为从头训练的配置,进行多次实验时,就可以建立不同的配置参数文件:
lr0: 0.01 # initial learning rate (i.e. SGD=1E-2, Adam=1E-3)
lrf: 0.001 # final learning rate (lr0 * lrf)
momentum: 0.937 # SGD momentum/Adam beta1
weight_decay: 0.0005 # optimizer weight decay 5e-4
warmup_epochs: 3.0 # warmup epochs (fractions ok)
warmup_momentum: 0.8 # warmup initial momentum
warmup_bias_lr: 0.1 # warmup initial bias lr
box: 7.5 # box loss gain
cls: 0.5 # cls loss gain (scale with pixels)
dfl: 1.5 # dfl loss gain
pose: 12.0 # pose loss gain
kobj: 1.0 # keypoint obj loss gain
label_smoothing: 0.0 # label smoothing (fraction)
nbs: 64 # nominal batch size
hsv_h: 0.015 # image HSV-Hue augmentation (fraction)
hsv_s: 0.7 # image HSV-Saturation augmentation (fraction)
hsv_v: 0.4 # image HSV-Value augmentation (fraction)
degrees: 0.0 # image rotation (+/- deg)
translate: 0.1 # image translation (+/- fraction)
scale: 0.5 # image scale (+/- gain)
shear: 0.0 # image shear (+/- deg)
perspective: 0.0 # image perspective (+/- fraction), range 0-0.001
flipud: 0.0 # image flip up-down (probability)
fliplr: 0.5 # image flip left-right (probability)
mosaic: 0.1 # image mosaic (probability)
mixup: 0.05 # image mixup (probability)
copy_paste: 0.0 # segment copy-paste (probability)
workers: 12 # number of workers
# cache: disk
自定义训练脚本
建立了自定义参数文件,我们还要建立自己的训练脚本来载入配置文件,并且还有一些经常变化的参数需要通过命令行传入, 新建train.py:
from ultralytics import YOLO
import yaml
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--data', type=str, default='configs/data/phd.yaml', help='dataset.yaml path')
parser.add_argument('--epochs', type=int, default=300, help='number of epochs')
parser.add_argument('--hyp', type=str, default='configs/hyp.yaml', help='size of each image batch')
parser.add_argument('--model', type=str, default='weights/yolov8n.pt', help='pretrained weights or model.config path')
parser.add_argument('--batch-size', type=int, default=64, help='size of each image batch')
parser.add_argument('--img-size', type=int, default=320, help='size of each image dimension')
parser.add_argument('--device', type=str, default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument('--project', type=str, default='yolo', help='project name')
parser.add_argument('--name', type=str, default='pretrain', help='exp name')
parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume most recent training')
args = parser.parse_args()
assert args.data, 'argument --data path is required'
assert args.model, 'argument --model path is required'
if __name__ == '__main__':
# Initialize
model = YOLO(args.model)
hyperparams = yaml.safe_load(open(args.hyp))
hyperparams['epochs'] = args.epochs
hyperparams['batch'] = args.batch_size
hyperparams['imgsz'] = args.img_size
hyperparams['device'] = args.device
hyperparams['project'] = args.project
hyperparams['name'] = args.name
hyperparams['resume'] = args.resume
model.train(data= args.data, **hyperparams)
该脚本通过argparse来接受命令行参数,并设置到超参数字典,和yolov5的启动脚本类似。
主要有以下几个参数(可以根据个人需要增删):
- data: 数据集配置文件
- hyp: 参数配置文件(上一节我们建立的)
- model: 模型权重或者模型结构配置文件
其他参数根据名字就显而易见了。
模型训练(单卡)
python train.py --model weights/yolov8n.pt --data
configs/data/objects365.yaml --hyp configs/hyp.yaml --batch-size 512 --img-size 416 --device
0 --project object365 --name yolov8n
模型训练(多卡DDP)
理论上,我们只需要将device设置为多张卡就可以进行多卡并行了,但我们直接运行会发生一下错误:
assert args.model, 'argument --model path is required'
也就是我们设置的参数并没有接收到,进一步分析,DDP情况下,实际运行的命令是:
DDP command: ['/root/miniconda3/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '4', '--master_port', '39083', 'xxx/code/yolov8/train.py']
WARNING:__main__:
也就是yolov8实际是用pytorch的ddp脚本启动了我们写得train.py脚本,但是却没有把我们设置的参数传过来(应该算是个bug吧···),这个过程发生在BaseTrainer的train接口中:
我们对generate_ddp_command进行修改,将命令行参数增加到train.py后(file后增加*sys.argv[1:]):
cmd = [sys.executable, '-m', dist_cmd, '--nproc_per_node', f'{world_size}', '--master_port', f'{port}', file, *sys.argv[1:]]
完整的函数:
def generate_ddp_command(world_size, trainer):
"""Generates and returns command for distributed training."""
import __main__ # noqa local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
if not trainer.resume:
shutil.rmtree(trainer.save_dir) # remove the save_dir
file = str(Path(sys.argv[0]).resolve())
safe_pattern = re.compile(r'^[a-zA-Z0-9_. /\\-]{1,128}$') # allowed characters and maximum of 100 characters
if not (safe_pattern.match(file) and Path(file).exists() and file.endswith('.py')): # using CLI
file = generate_ddp_file(trainer)
dist_cmd = 'torch.distributed.run' if TORCH_1_9 else 'torch.distributed.launch'
port = find_free_network_port()
cmd = [sys.executable, '-m', dist_cmd, '--nproc_per_node', f'{world_size}', '--master_port', f'{port}', file, *sys.argv[1:]]
return cmd, file
修改后,device设置多卡则能正常开启训练。
结语
本文介绍了如何使用自定义训练脚本的方式启动yolov8的训练,有效的结合命令行和配置文件的优点,即可以灵活的修改训练参数,又可以用配置文件来管理我们的训练超参数。并通过修改文件,支持了DDP训练。