Ubuntu多卡服务器、普通用户安装paddlepaddle环境
- 1. 建立conda虚拟环境
- 2. 安装paddlepaddle gpu版本
- 2.1 选择cuda版本
- 2.2 安装paddle
- 3. 验证及排错
- 3.1 验证方法
- 3.2 第一次报错:cuda问题
- 3.3 第二次报错:NCCL问题(多卡)
- 4. 设置环境变量,可以不用每次设置依赖目录
之前在本地的Ubuntu机器上安装paddle环境还挺顺利的,但是在多卡服务器上安装确遇到了很多问题,主要是服务器上已经安装了cuda等环境,普通用户也没有权限修改系统的依赖,多卡环境与单卡也有些区别。
主要参考资料就是paddle官方文档paddle安装说明
1. 建立conda虚拟环境
- 新建虚拟环境
conda create -n paddle_env python=3.9
- 进入虚拟环境
conda activate paddle_env
2. 安装paddlepaddle gpu版本
这里强调一下,一定要用conda安装,conda安装可以直接在当前环境下安装独立的cuda等依赖,这样就不会与系统预装的依赖环境冲突,pip安装就比较麻烦。
2.1 选择cuda版本
建议在终端输入nvidia-smi
查看系统的CUDA Version
,选择比系统版本小的cuda版本,例如我当前机器的CUDA Version
是11.5,所以我选择安装cuda 11.2
2.2 安装paddle
conda install paddlepaddle-gpu==2.4.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
3. 验证及排错
3.1 验证方法
安装完成后您可以使用 python3 进入 python 解释器,输入import paddle ,再输入 paddle.utils.run_check()
如果出现PaddlePaddle is installed successfully!,说明您已成功安装。
3.2 第一次报错:cuda问题
W0505 03:08:12.283917 3969672 dynamic_loader.cc:307] The third-party dynamic library (libcudnn.so) that Paddle depends on is not configured correctly. (error code is /usr/local/cuda/lib64/libcudnn.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX;
- 解决方法
查看环境安装的路径下,其实已经有了cuda相关的依赖:
但是目前还是寻找的系统目录,所以指定到环境目录就可以,在终端输入命令:
export LD_LIBRARY_PATH=[安装路径]/miniconda3/envs/paddle_env/lib
再次验证,可以看到刚才的错误已经不在了。
3.3 第二次报错:NCCL问题(多卡)
W0505 03:22:18.677640 3977430 dynamic_loader.cc:278] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.
[2023-05-05 03:22:18,678] [ WARNING] install_check.py:281 - PaddlePaddle meets some problem with 4 GPUs. This may be caused by:
1. There is not enough GPUs visible on your system
2. Some GPUs are occupied by other process now
3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2023-05-05 03:22:18,679] [ WARNING] install_check.py:289 -
Original Error is: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:305)
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
- 解决方法
下载安装NCCL,这个需要去NVIDIA 官网下载,下载地址。
下载完解压
tar xvf nccl_2.17.1-1+cuda11.0_x86_64.txz
解压后可以直接把库拷贝到环境安装目录下
这时再次验证即可通过!
4. 设置环境变量,可以不用每次设置依赖目录
- 如果要进入paddle环境,需要设置环境变量
export LD_LIBRARY_PATH=[安装路径]/miniconda3/envs/paddle_env/lib
- 可以设置为每次打开终端,自动设置环境变量
vim ~/.bashrc
再最下边输入
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[环境目录]/miniconda3/envs/paddle_env/lib
保存退出后,重新打开终端即生效。