目录
- Paddle爱恨史
- PaddleCloud
- 多卡
Paddle爱恨史
Paddle是由百度开发的国内的深度学习框架,PaddlePaddle支撑了PaddleOCR、PaddleNLP等一系列领域内的开源工具包,为国内深度学习的落地与实践做出了大量贡献。
但是,PaddlePaddle安装问题一直都困扰着我,什么````C++```报错了、什么不能使用多卡了,不同Linux环境安装后报错也各不相同。。。诸多限制,让我对它又渐渐疏远。怎么样,才能让Paddle安装像torch那么丝滑,开箱即用,而不是陷入各种报错当中,在不断摸索的过程中,也渐渐看到了方向。
PaddleCloud
先放上链接:https://hub.docker.com/r/paddlecloud/paddlenlp
某一天,在PaddleNLP文档上查看资料,看到PaddleCloud开源了基于Paddle的镜像,可开箱即用。
PaddleCloud主要用于存储飞桨模型套件PaddleNLP的标准镜像,方便模型套件用户进行Docker化部署或在云上部署。
然后我立刻尝试,将镜像拉取到linux服务器上,
docker pull paddlecloud/paddlenlp:develop-gpu-cuda11.2-cudnn8-latest
接下来就是创建容器,
docker run -itd --name container_name -v /path:/path paddlecloud/paddlenlp:develop-gpu-cuda11.2-cudnn8-latest /bin/bash
进入容器
docker exec -it container_name /bin/bash
检查PaddlePaddle框架是否正常
python
>>import paddle
>>paddle.utils.run_check()
>Running verify PaddlePaddle program ...
W0130 06:01:35.244894 23 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.2
W0130 06:01:35.276093 23 gpu_context.cc:306] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
W0130 06:01:44.027418 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0130 06:01:44.027439 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2
W0130 06:01:44.027443 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 3
W0130 06:01:44.027446 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 4
W0130 06:01:44.027449 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 5
W0130 06:01:44.027452 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 6
W0130 06:01:44.027456 23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 7
W0130 06:01:44.027458 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
W0130 06:01:44.027462 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2
W0130 06:01:44.027464 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 3
W0130 06:01:44.027467 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 4
W0130 06:01:44.027469 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 5
W0130 06:01:44.027472 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 6
W0130 06:01:44.027477 23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 7
W0130 06:01:44.027480 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0
W0130 06:01:44.027523 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1
W0130 06:01:44.027529 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 3
W0130 06:01:44.027530 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 4
W0130 06:01:44.027534 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 5
W0130 06:01:44.027536 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 6
W0130 06:01:44.027541 23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 7
W0130 06:01:44.027544 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 0
W0130 06:01:44.027549 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 1
W0130 06:01:44.027554 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 2
W0130 06:01:44.027556 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 4
W0130 06:01:44.027559 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 5
W0130 06:01:44.027611 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 6
W0130 06:01:44.027614 23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 7
W0130 06:01:44.027617 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 0
W0130 06:01:44.027621 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 1
W0130 06:01:44.027624 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 2
W0130 06:01:44.027627 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 3
W0130 06:01:44.027629 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 5
W0130 06:01:44.027632 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 6
W0130 06:01:44.027635 23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 7
W0130 06:01:44.027638 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 0
W0130 06:01:44.027640 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 1
W0130 06:01:44.027643 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 2
W0130 06:01:44.027647 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 3
W0130 06:01:44.027649 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 4
W0130 06:01:44.027652 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 6
W0130 06:01:44.027655 23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 7
W0130 06:01:44.027696 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 0
W0130 06:01:44.027699 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 1
W0130 06:01:44.027704 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 2
W0130 06:01:44.027707 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 3
W0130 06:01:44.027712 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 4
W0130 06:01:44.027717 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 5
W0130 06:01:44.027720 23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 7
W0130 06:01:44.027724 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 0
W0130 06:01:44.027727 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 1
W0130 06:01:44.027730 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 2
W0130 06:01:44.027736 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 3
W0130 06:01:44.027740 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 4
W0130 06:01:44.027752 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 5
W0130 06:01:44.027757 23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 6
WARNING:root:PaddlePaddle meets some problem with 8 GPUs. This may be caused by:
1. There is not enough GPUs visible on your system
2. Some GPUs are occupied by other process now
3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
WARNING:root:
Original Error is: (External) NCCL error(2), unhandled system error.
[Hint: 'ncclSystemError'. A call to the system failed.] (at /paddle/paddle/fluid/platform/device/gpu/nccl_helper.h:155)
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
出现了上面的结果,说明安装成功,但是只能使用单卡,虽然不能使用多卡,但是勉强用着吧,
多卡
目前深度学习训练过程,一般2张起步,对于PaddlePaddle不能使用多卡,还是耿耿于怀。经过一番查询之后,发现是NCCL出了问题。怎么解决,参考不少资料。最终发现了问题所在,
解决链接:
https://github.com/pytorch/pytorch/issues/73775
因此,删掉之前创建的容器,重新创建。
docker run -itd --name container_name -v /path:/path -v /dev/shm/:/dev/shm paddlecloud/paddlenlp:develop-gpu-cuda11.2-cudnn8-latest /bin/bash
进入容器后,检查Paddle是否正常
>>paddle.utils.run_check()
Running verify PaddlePaddle program ...
W0130 06:10:52.232132 22 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.2
W0130 06:10:52.234642 22 gpu_context.cc:306] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
W0130 06:10:54.919947 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0130 06:10:54.919976 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2
W0130 06:10:54.919981 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 3
W0130 06:10:54.919983 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 4
W0130 06:10:54.919986 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 5
W0130 06:10:54.919989 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 6
W0130 06:10:54.919992 22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 7
W0130 06:10:54.919996 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
W0130 06:10:54.919998 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2
W0130 06:10:54.920001 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 3
W0130 06:10:54.920003 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 4
W0130 06:10:54.920009 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 5
W0130 06:10:54.920012 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 6
W0130 06:10:54.920019 22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 7
W0130 06:10:54.920022 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0
W0130 06:10:54.920027 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1
W0130 06:10:54.920029 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 3
W0130 06:10:54.920037 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 4
W0130 06:10:54.920039 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 5
W0130 06:10:54.920044 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 6
W0130 06:10:54.920084 22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 7
W0130 06:10:54.920087 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 0
W0130 06:10:54.920092 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 1
W0130 06:10:54.920095 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 2
W0130 06:10:54.920099 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 4
W0130 06:10:54.920101 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 5
W0130 06:10:54.920104 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 6
W0130 06:10:54.920106 22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 7
W0130 06:10:54.920110 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 0
W0130 06:10:54.920117 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 1
W0130 06:10:54.920123 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 2
W0130 06:10:54.920127 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 3
W0130 06:10:54.920132 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 5
W0130 06:10:54.920135 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 6
W0130 06:10:54.920140 22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 7
W0130 06:10:54.920146 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 0
W0130 06:10:54.920152 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 1
W0130 06:10:54.920157 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 2
W0130 06:10:54.920164 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 3
W0130 06:10:54.920169 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 4
W0130 06:10:54.920176 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 6
W0130 06:10:54.920181 22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 7
W0130 06:10:54.920184 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 0
W0130 06:10:54.920190 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 1
W0130 06:10:54.920194 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 2
W0130 06:10:54.920200 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 3
W0130 06:10:54.920207 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 4
W0130 06:10:54.920212 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 5
W0130 06:10:54.920217 22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 7
W0130 06:10:54.920221 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 0
W0130 06:10:54.920228 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 1
W0130 06:10:54.920233 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 2
W0130 06:10:54.920238 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 3
W0130 06:10:54.920243 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 4
W0130 06:10:54.920254 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 5
W0130 06:10:54.920261 22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 6
W0130 06:11:12.578923 22 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2.
PaddlePaddle works well on 8 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
出现了 PaddlePaddle is installed successfully!,说明Paddle完全安装成功,没有问题了。
在用Paddle之路上,找到一个较为方便的Paddle安装方法,分享给大家。