CUDA在suspend之后不可用问题
问题描述
一觉醒来,电脑cuda不可用
/home/你的电脑/pytorch/lib/python3.8/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
-
尝试
export PATH=/usr/local/cuda-11/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11/lib64:$LD_LIBRARY_PATH
- 但不是因为没有加载环境变量
-
根据查到参考[1]中,可能与电脑suspend相关,查到[2]
-
系统无法与GPU通信会提示这样的错误
- 原因1:因为驱动更新但未重启或者其他安装问题
- 原因2:电脑进入过suspend状态,重启可再次生效
解决办法
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
- 快速验证是否可用
import torch
torch.cuda.is_available()
关于rmmod
和modprobe
介绍可以参考[3]的介绍
参考
[1] https://blog.csdn.net/weixin_48319333/article/details/128214617
[2] https://discuss.pytorch.org/t/userwarning-cuda-initialization-cuda-unknown-error-this-may-be-due-to-an-incorrectly-set-up-environment-e-g-changing-env-variable-cuda-visible-devices-after-program-start-setting-the-available-devices-to-be-zero/129335/2
[3] https://blog.csdn.net/Ternence_zq/article/details/131068125