欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/133640212
错误日志:
# ...
File "lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
trainer._teardown()
File "lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1121, in _teardown
self.strategy.teardown()
File "lib/python3.7/site-packages/pytorch_lightning/strategies/horovod.py", line 241, in teardown
super().teardown()
File "lib/python3.7/site-packages/pytorch_lightning/strategies/parallel.py", line 114, in teardown
super().teardown()
File "lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 499, in teardown
self.accelerator.teardown()
File "lib/python3.7/site-packages/pytorch_lightning/accelerators/cuda.py", line 76, in teardown
torch.cuda.empty_cache()
File "lib/python3.7/site-packages/torch/cuda/memory.py", line 125, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
# ...
核心错误:CUDA error: an illegal memory access was encountered
,遇到非法内存访问。
原因:显存溢出,降低配置中影响显存占用的参数即可,例如输入特征的尺寸,即可。
观察 WanbB 显存占用,也可及时发现,例如,高显存 100% 占用,容易造成内存溢出:
正常占用 83%: