同一Python脚本中训练多个模型时的 wandb 配置错误解决方案

news2025/4/11 14:38:49

文章目录

- 摘要
- 背景介绍
- 报错信息
- wandb 模型训练名

摘要

在机器学习项目中，使用Python脚本训练多个模型时，可能会遇到WandB（Weights and Biases）配置错误，尤其是在训练多个模型参数大小不一致的情况下。
本文将介绍如何解决因模型参数变化导致的WandB配置错误，具体报错信息为“Attempted to change value of key ‘model/num_parameters’ from xxxx to xxxx”。我们将探讨如何通过调用wandb.finish()来结束前一个模型的训练，并确保下一个模型的训练能够顺利进行。可以有效地避免因模型参数变化而导致的训练中断，并提升模型训练的可管理性。

此外，文章还将介绍如何自定义wandb日志名称，通过设置transformers.TrainingArguments.run_name来改变默认的输出路径名称，以便更好地管理和识别不同模型的训练记录。

背景介绍

在同一个python脚本文件中，训练两个不同参数大小的模型，从而出现了下述报错。

报错信息

/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 475/475 [00:12<00:00, 43.48it/s]wandb: ERROR Attempted to change value of key "model/num_parameters" from 102269186 to 102270724
wandb: ERROR If you really want to do this, pass allow_val_change=True to config.update()
Traceback (most recent call last):
  File "/home/jie/github/text_classification/BERT/demo/test.py", line 193, in <module>
    pred = bert_multi_cls.pred(ds["test"])
  File "/home/jie/github/text_classification/BERT/demo/test.py", line 147, in pred
    predictions = self.trainer.predict(pred_dataset)
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3761, in predict
    self.control = self.callback_handler.on_predict(self.args, self.state, self.control, output.metrics)
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer_callback.py", line 492, in on_predict
    return self.call_event("on_predict", args, state, control, metrics=metrics)
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer_callback.py", line 507, in call_event
    result = getattr(callback, event)(
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 989, in on_predict
    self.setup(args, state, **kwargs)
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 853, in setup
    self._wandb.config["model/num_parameters"] = model.num_parameters()
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/wandb/sdk/wandb_config.py", line 149, in __setitem__
    key, val = self._sanitize(key, val)
  File "/home/jie/anaconda3/envs/llm/lib/python3.10/site-packages/wandb/sdk/wandb_config.py", line 285, in _sanitize
    raise config_util.ConfigError(
wandb.sdk.lib.config_util.ConfigError: Attempted to change value of key "model/num_parameters" from 102269186 to 102270724
If you really want to do this, pass allow_val_change=True to config.update()

报错信息如上所示，主要看这个报错信息：

wandb.sdk.lib.config_util.ConfigError: Attempted to change value of 
key "model/num_parameters" from 102269186 to 102270724

模型参数从 102269186 变化到了 102270724。模型尺寸变化是不允许的，从而出现了报错。
但是我在同一个python 文件中，训练多个模型，模型尺寸就是不同的。

报错信息的解决方案是给 wandb 的config.update 传参 allow_val_change=True。由于我们使用的是 transformers 的 Trainer 进行训练，allow_val_change参数很难通过Trainer传递给 wandb 的 config。

此时的解决方案就是使用:

import wandb
wandb.finish()

在前一个 Trainer 模型训练结束之后，调用 wandb.finish()，结束 wandb 的日志记录。

下一个模型训练，transformers 的 Trainer，又会自动创建 wandb。对日志记录也不会有任何影响。

wandb 模型训练名

浏览 wandb 网站的日志记录，名字默认是 transformers.TrainingArguments.output_dir。wandb 的日志名，默认是输出路径名。
在这里插入图片描述

如果想让wandb显示的名字和output_dir不一样，通过设置transformers.TrainingArguments.run_name
wandb的名字就会变成 run_name。

run_name (`str`, *optional*, defaults to `output_dir`):
    A descriptor for the run. Typically used for [wandb](https://www.wandb.com/),
    [mlflow](https://www.mlflow.org/) and [comet](https://www.comet.com/site) logging. If not specified, will
    be the same as `output_dir`.

上述是TrainingArguments的源代码，其中说道 run_name 是供wandb使用的。
run_name默认是None，如果不指定，run_name 会和 output_dir 保持一致。

如下代码所示，在初始化TrainingArguments 时，指定 run_name。

args = TrainingArguments(
         output_dir=output_dir,
         eval_strategy="epoch",
         save_strategy="epoch",
         save_total_limit=3,
         learning_rate=2e-5,
         num_train_epochs=epoch,
         weight_decay=0.01,
         per_device_train_batch_size=32,
         per_device_eval_batch_size=16,
         # logging_steps=16,
         save_safetensors=True,
         overwrite_output_dir=True,
         load_best_model_at_end=True,
         run_name=run_name,
     )